Wednesday 2 March 2022

ARMv8-A Memory systems

 Memory management

The ARMv8-A architecture employs a weakly ordered model of memory. This means that the order of memory accesses is not necessarily required to be the same as the program order for load and store operations.
 
During the optimization process, the processor and system elements can reorder memory read operations with respect to each other to improve data throughput. Writes can also be reordered. This means that the required bandwidth between the processor and external memory can be reduced and the long latencies that are associated with such external memory accesses are hidden.

 To ensure that reordering can take place, there must be memory types that allow such optimizations to take place in them. 

Hardware can reorder reads and writes to Normal memory. Reads and writes can also be ordered by address dependencies, and half barriers. However, the existence of either data dependencies or explicit memory barrier instructions can override this. Certain situations require stronger ordering rules. You can provide information to the core about this through the memory type attribute of the translation table entry that describes that memory. 

High-performance systems can support techniques such as speculative memory reads, multiple issuing of instructions, or out-of-order execution and these, along with other techniques, offer further possibilities for hardware reordering of memory access:

Page table entry

  • From virtual to physical address translation, as well as attributes for that address
  • Some bits are for OS, such as dirty and accessed (PTE_YOUNG / PTE_OLD in Linux)

Memory types = normal

    Normal memory is used for all code and for most data regions in memory. Examples of Normal memory include areas of RAM, Flash, or ROM in physical memory. This kind of memory provides the highest processor performance as it is weakly ordered and the compiler can perform more optimizations. The processor can reorder, repeat, and merge accesses to Normal memory.

  • Reordering
  • Merging
  • Speculation 
  • Unaligned
  • either cacheable or non-cacheable are OK

Memory type = device

The Device memory type is used with memory-mapped peripherals and all memory regions where an access might have a side effect. For example, a read to a timer is not repeatable, as it returns different values for each read. A write to a control register can trigger an interrupt. The Device memory type imposes more restrictions on the core.

Speculative data accesses cannot be performed to regions of memory that are marked as Device. Trying to execute code from a region marked as Device is UNPREDICTABLE. 
 
  • Side effects
  • Cannot do speculative access
  • Cannot be executable
  • Attributes
    • Gathering?
    • Re-ordering?
    • Early ack?
  • Device type: stronger to weaker
    • GRE -> nGRE -> nGnRE -> nGnRnE
    • Can upgrade to a weaker type

Barrier :

The ARM architecture includes barrier instructions to force access ordering and access completion at a specific point. Barriers are used to prevent unsafe optimizations from occurring and to enforce a specific memory ordering. Use of unnecessary barrier instructions can therefore reduce software performance. Consider carefully whether a barrier is necessary in a specific situation, and if so, which is the correct barrier to use.

       ISB     -- Instruction synchronization barrier   
       DMB  -- Data memory barrier
       DSB   -- Data synchronization barrier



MMU (memory management unit)

  • Software defines the translation, MMU in charge of reading that table and provide the translation service to the core
  • TLB (translation look-aside buffer) + PTW (page table walker)
    • TLB of most modern ARM cores also caches intermediate steps of translation to speed up the process
  • MMU is before the cache, so cache works with physical address and won’t be affected by changes in address translation

Virtual address space

AArch64 uses 48-bit virtual address, and there are 2 of them. One for kernel (not avaible in EL2 and EL3), one for application. So there are 2 sets of translation tables, which are both in memory. TTBR is pointing to the translation table base.

Translation table (page table)

  • 3-level of tables
  • 3 different sizes of page
    • 4KB, 16KB or 64KB

Translation regimes

  • EL3 secure monitor table
  • EL2 hypervisor table
  • EL1/EL0 goes through 2 stages of translation tables for virtualization

Translation regimes

Secure physical address spaces

  • Secure vs non-secure
    • Non-secure program in EL1/EL0 can only access non-secure physical address
    • Secure EL1/EL0 programs can access both

References:

https://developer.arm.com/documentation/100941/0100/Memory-typesarmv8_a_memory_systems_100941_0100_en.pdf

https://phdbreak99.github.io/blog/arch/2019-03-18-armv8-architecture/

https://stackoverflow.com/questions/65684882/is-memory-reordering-equivalent-to-instruction-reordering

No comments:

Post a Comment