Thursday, 12 August 2021

Linux memory management paging_init

 The introduced the creation of the startup page table, the establishment of the DTB physical address mapping through the fix map, and the memblock management of physical memory. Now we can allocate physical memory through memblock, but the allocated memory cannot be accessed yet. We need to map the memory managed by memblock.

The kernel's page table initialization function paging_init() and based arm64 on linux kernel 5.4.0

void __init paging_init(void) { pgd_t *pgdp = pgd_set_fixmap(__pa_symbol(swapper_pg_dir)); -------1 map_kernel(pgdp); --------------2 map_mem(pgdp); ----------------3 pgd_clear_fixmap(); cpu_replace_ttbr1(lm_alias(swapper_pg_dir)); -------------4 init_mm.pgd = swapper_pg_dir; memblock_free(__pa_symbol(init_pg_dir), __pa_symbol(init_pg_end) - __pa_symbol(init_pg_dir)); -----------------5 memblock_allow_resize(); }

  1. The physical address of the swapper_pg_dir page table is mapped to the FIX_PGD area of ​​the fixmap, and then the swapper_pg_dir page table is used as the pgd page table of the kernel. Because the page table is built in the virtual address space, it needs to be converted to the virtual address pgdp here. At this time, the partner system is not ready, and can only preset the page table for mapping PGD through fixmap. Now pgdp is the virtual address corresponding to the physical memory space allocated FIX_PGD.
  2. Map the (.text.init.data.bss) area of ​​the kernel image.
  3. .Map the physical memory added by the memblock subsystem to the linear area.
  4. The ttbr1 register to the newly prepared swapper_pg_dir page table. It should be noted that ttbr1 saves the physical address, so cpu_replace_ttbr1() will first convert the address of the page table of swapper_pg_dir to a physical address.
    1. /* * Atomically replaces the active TTBR1_EL1 PGD with a new VA-compatible PGD, * avoiding the possibility of conflicting TLB entries being allocated. */ static inline void __nocfi cpu_replace_ttbr1(pgd_t *pgdp) { typedef void (ttbr_replace_func)(phys_addr_t); extern ttbr_replace_func idmap_cpu_replace_ttbr1; ttbr_replace_func *replace_phys; /* phys_to_ttbr() zeros lower 2 bits of ttbr with 52-bit PA */ phys_addr_t ttbr1 = phys_to_ttbr(virt_to_phys(pgdp)); if (system_supports_cnp() && !WARN_ON(pgdp != lm_alias(swapper_pg_dir))) { /* * cpu_replace_ttbr1() is used when there's a boot CPU * up (i.e. cpufeature framework is not up yet) and * latter only when we enable CNP via cpufeature's * enable() callback. * Also we rely on the cpu_hwcap bit being set before * calling the enable() function. */ ttbr1 |= TTBR_CNP_BIT; } replace_phys = (void *)__pa_function(idmap_cpu_replace_ttbr1); cpu_install_idmap(); replace_phys(ttbr1); cpu_uninstall_idmap(); } After that, the pgd page table of the init_mm process is also switched from init_pg_dir to swapper_pg_dir.
  5. The above has been remapped the various segments of the kernel image through map_kernel(), init_pg_dir has no value, and the area pointed to by init_pg_dir is released.

As shown in the figure below, before and after paging_init() is executed,
the base address of the page table saved by ttbr1 is switched from init_pg_dir to swapper_pg_dir.
The base address of the swapper_pg_dir page table is used as the base address of the PGD page table.




map_kernel The map_kernel is to complete the mapping of the various segments of the kernel. After all, the kernel wants to run normally, and all the addresses it needs need to be mapped. As mentioned above, identity mapping was used in the early stage of the kernel, but it was only a temporary mapping. The physical memory occupied by PGD/PUD/PMD/PTE is continuous and needs to be remapped. Before kernel 4.6, the kernel image is stored in a linear address, so this action is not required. The later patch arm64: move kernel image to base of vmalloc area In order to realize the characteristics of kaslr, move the kernel image to the vmalloc area. arm64: move kernel image to base of vmalloc area

static void __init map_kernel(pgd_t *pgdp) { static struct vm_struct vmlinux_text, vmlinux_rodata, vmlinux_inittext, vmlinux_initdata, vmlinux_data; /* * External debuggers may need to write directly to the text * mapping to install SW breakpoints. Allow this (only) when * explicitly requested with rodata=off. */ pgprot_t text_prot = rodata_enabled ? PAGE_KERNEL_ROX : PAGE_KERNEL_EXEC; /* * Only rodata will be remapped with different permissions later on, * all other segments are allowed to use contiguous mappings. */ map_kernel_segment(pgdp, _text, _etext, text_prot, &vmlinux_text, 0, VM_NO_GUARD); -----------1 map_kernel_segment(pgdp, __start_rodata, __inittext_begin, PAGE_KERNEL, &vmlinux_rodata, NO_CONT_MAPPINGS, VM_NO_GUARD);-------2 map_kernel_segment(pgdp, __inittext_begin, __inittext_end, text_prot, &vmlinux_inittext, 0, VM_NO_GUARD);----------------3 map_kernel_segment(pgdp, __initdata_begin, __initdata_end, PAGE_KERNEL, &vmlinux_initdata, 0, VM_NO_GUARD);------------4 map_kernel_segment(pgdp, _data, _end, PAGE_KERNEL, &vmlinux_data, 0, 0);-----5 if (!READ_ONCE(pgd_val(*pgd_offset_raw(pgdp, FIXADDR_START)))) { /* * The fixmap falls in a separate pgd to the kernel, and doesn't * live in the carveout for the swapper_pg_dir. We can simply * re-use the existing dir for the fixmap. */ set_pgd(pgd_offset_raw(pgdp, FIXADDR_START), READ_ONCE(*pgd_offset_k(FIXADDR_START))); ---------6 } else if (CONFIG_PGTABLE_LEVELS > 3) { pgd_t *bm_pgdp; pud_t *bm_pudp; /* * The fixmap shares its top level pgd entry with the kernel * mapping. This can really only occur when we are running * with 16k/4 levels, so we can simply reuse the pud level * entry instead. */ BUG_ON(!IS_ENABLED(CONFIG_ARM64_16K_PAGES)); bm_pgdp = pgd_offset_raw(pgdp, FIXADDR_START); bm_pudp = pud_set_fixmap_offset(bm_pgdp, FIXADDR_START); pud_populate(&init_mm, bm_pudp, lm_alias(bm_pmd)); pud_clear_fixmap(); } else { BUG(); } kasan_copy_shadow(pgdp); } 1 to 5 Call map_kernel_segment() respectively to complete the mapping of text, rodata, init, bss, and data segments. It should be noted that when mapping the rodata segment, the flag is set to NO_CONT_MAPPINGS. Currently there are two main types of flags: #define NO_BLOCK_MAPPINGS BIT(0) #define NO_CONT_MAPPINGS BIT(1) NO_BLOCK_MAPPINGS is used to mark the limit BLOCK_MAPPING (mapping of huge pages) NO_CONT_MAPPINGS is used to mark the continuous physical page that restricts the mapping.arm64: mm: set the contiguous bit for kernel mappings where appropriate

Why restrict continuous mapping of rodata segment? You can refer to this patch: arm64: mm: set the contiguous bit for kernel mappings where appropriate

When mapping continuous physical pages, you can save the TLB entry by setting the contiguous bit of the TLB entry (contiguous-tlb can refer to this patch arm64: Add support for PTE contiguous bit), which can reduce the occupation of tlb space. But this will cause a problem. Contiguous mapping requires that the entire area has read/write permissions. The rodata segment of the kernel is read-only and cannot be modified, so contiguous mapping cannot be used.

(6)fixaddr_start :


map_sem:

map_sem() completes the mapping of the physical memory added in memblock


static void __init map_mem(pgd_t *pgdp)
{
  phys_addr_t kernel_start = __pa_symbol(_text);
  phys_addr_t kernel_end = __pa_symbol(__init_begin);
  struct memblock_region *reg;
  int flags = 0;
 
  if (rodata_full || debug_pagealloc_enabled())
  flags = NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
 
  /*
  * Take care not to create a writable alias for the
  * read-only text and rodata sections of the kernel image.
  * So temporarily mark them as NOMAP to skip mappings in
  * the following for-loop
  */
  memblock_mark_nomap(kernel_start, kernel_end - kernel_start); ----------------1
 #ifdef CONFIG_KEXEC_CORE
  if (crashk_res.end)
  memblock_mark_nomap(crashk_res.start,
      resource_size(&crashk_res));
 #endif
 
  /* map all the memory banks */
  for_each_memblock(memory, reg) { ----------------------------2
  phys_addr_t start = reg->base;
  phys_addr_t end = start + reg->size;
 
  if (start >= end)
  break;
  if (memblock_is_nomap(reg))
  continue;
 
  __map_memblock(pgdp, start, end, PAGE_KERNEL, flags);
  }
 
  /*
  * Map the linear alias of the [_text, __init_begin) interval
  * as non-executable now, and remove the write permission in
  * mark_linear_text_alias_ro() below (which will be called after
  * alternative patching has completed). This makes the contents
  * of the region accessible to subsystems such as hibernate,
  * but protects it from inadvertent modification or execution.
  * Note that contiguous mappings cannot be remapped in this way,
  * so we should avoid them here.
  */
  __map_memblock(pgdp, kernel_start, kernel_end,
         PAGE_KERNEL, NO_CONT_MAPPINGS); --------------------3
  memblock_clear_nomap(kernel_start, kernel_end - kernel_start);------------------4
 
 #ifdef CONFIG_KEXEC_CORE
  /*
  * Use page-level mappings here so that we can shrink the region
  * in page granularity and put back unused memory to buddy system
  * through /sys/kernel/kexec_crash_size interface.
  */
  if (crashk_res.end) {
  __map_memblock(pgdp, crashk_res.start, crashk_res.end + 1,
         PAGE_KERNEL,
         NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS);
  memblock_clear_nomap(crashk_res.start,
       resource_size(&crashk_res));
  }
 #endif
}

1.The memblock with the MEMBLOCK_NOMAP flag is not mapped.
2.Traverse each block in memblock and complete the memory mapping
3.Map the physical address of [kernel_start, kernel_end] to the linear mapping area, 
which corresponds to the text and rodata segments of the kernel.
The two sections are mapped again, and the address of this section is mapped. It is written in the comments:

Other subsystems (such as hebernate hibernation) will be mapped to 
the linear mapping area, and the kernel text or data segment needs to be referenced through the linear mapping address.

4.Clear the nomap flag in the kernel memblock area.

 As shown in the figure below, memory.region[0]/[2]/[3] is mapped to the linear mapping region, and memory.region[1] is double mapped to 
 the kimage region and the linear mapping region.
create_pgd_mapping:
 
 The page table mapping will eventually call the create_pgd_mapping function.
 arch/arm64/mmu.c
 
 static void __create_pgd_mapping(pgd_t *pgdir, phys_addr_t phys,
  unsigned long virt, phys_addr_t size,
  pgprot_t prot,
  phys_addr_t (*pgtable_alloc)(int),
  int flags)
 {
  unsigned long addr, length, end, next;
  pgd_t *pgdp = pgd_offset_raw(pgdir, virt);----------1
 
  /*
  * If the virtual and physical address don't have the same offset
  * within a page, we cannot map the region as the caller expects.
  */
  if (WARN_ON((phys ^ virt) & ~PAGE_MASK))
  return;
 
  phys &= PAGE_MASK; ---------|
  addr = virt & PAGE_MASK; ----------|
  length = PAGE_ALIGN(size + (virt & ~PAGE_MASK));-----------------2 
 
  end = addr + length;
  do {
  next = pgd_addr_end(addr, end);-------------3
  alloc_init_pud(pgdp, addr, next, phys, prot, pgtable_alloc,
         flags); ---------------4
  phys += next - addr;
  } while (pgdp++, addr = next, addr != end);
}
1.The page table entry of the pgd page table. 
 Equivalent to pgd_t *pgdp = &pgd_entry[i]. The address stored in pgd_entry is a physical address, 
 but the following calculations are based on virtual addresses, and an address conversion is done here
2.The above three lines are to obtain the page offset of the starting physical address, 
obtain the page offset of the starting virtual address, and check how many pages are in the address range, and 
whether the content and size of the address are aligned.
3.There may be multiple entries in the address range of (addr, end), so use next-addr as the step size, 
 which is PGDIR_SIZE to call the alloc_init_pud function to complete the mapping of this virtual address (addr, end)
4.alloc_init_pud() is used to initialize the contents of pgd page table entries and the next level page table PUD.

alloc_init_pud:

static void alloc_init_pud(pgd_t *pgdp, unsigned long addr, unsigned long end,
     phys_addr_t phys, pgprot_t prot,
     phys_addr_t (*pgtable_alloc)(int),
     int flags)
{
  unsigned long next;
  pud_t *pudp;
  pgd_t pgd = READ_ONCE(*pgdp);
 
  if (pgd_none(pgd)) { --------------------------1
  phys_addr_t pud_phys;
  BUG_ON(!pgtable_alloc);
  pud_phys = pgtable_alloc(PUD_SHIFT);
  __pgd_populate(pgdp, pud_phys, PUD_TYPE_TABLE);
  pgd = READ_ONCE(*pgdp);
  }
  BUG_ON(pgd_bad(pgd));
 
  pudp = pud_set_fixmap_offset(pgdp, addr); -------------------2
  do {
  pud_t old_pud = READ_ONCE(*pudp);
 
  next = pud_addr_end(addr, end); ---------------3
 
  /*
  * For 4K granule only, attempt to put down a 1GB block
  */
  if (use_1G_block(addr, next, phys) &&
      (flags & NO_BLOCK_MAPPINGS) == 0) {
  pud_set_huge(pudp, phys, prot); ---------------4
 
  /*
  * After the PUD entry has been populated once, we
  * only allow updates to the permission attributes.
  */
  BUG_ON(!pgattr_change_is_safe(pud_val(old_pud),
        READ_ONCE(pud_val(*pudp))));
  } else {
  alloc_init_cont_pmd(pudp, addr, next, phys, prot,
      pgtable_alloc, flags); ------------------5
 
  BUG_ON(pud_val(old_pud) != 0 &&
         pud_val(old_pud) != READ_ONCE(pud_val(*pudp)));
  }
  phys += next - addr;
  } while (pudp++, addr = next, addr != end);
 
  pud_clear_fixmap();
}
 
1.Determine whether the content of the current pgd page table entry is empty. 
 If it is empty, it means that the next-level pud page table is empty, and the next-level 
 pud page table needs to be dynamically allocated. Here, memblock is used to allocate memory (512 page table entries), 
 and then the relationship between pgd entry and PUD page table memory is established through pgd_populate.
 
2. Get the corresponding pud page table entry

3. Like pgd, (addr, end) here may also correspond to more than one pud entry, 
so use PUD_SIZE as the step size to fill the pud entry cyclically.
4.Determine whether to use a 1G block for mapping? If so, a PUD page table entry can complete 
 the address mapping of 1G size, and there is no need for PMD and PTE page tables.
 
5.If block mapping cannot be performed, then the next level of page table mapping is performed through alloc_init_cont_pmd().

Alloc_init_cont_pmd is similar to the above steps, so I won’t repeat them.
Now jump directly to the mapping of the pte page table.

alloc_init_cont_pte:

static void alloc_init_cont_pte(pmd_t *pmdp, unsigned long addr,
  unsigned long end, phys_addr_t phys,
  pgprot_t prot,
  phys_addr_t (*pgtable_alloc)(int),
  int flags)
 {
  unsigned long next;
  pmd_t pmd = READ_ONCE(*pmdp);
 
  BUG_ON(pmd_sect(pmd)); ------------------1
  if (pmd_none(pmd)) { --------------------2
  phys_addr_t pte_phys;
  BUG_ON(!pgtable_alloc);
  pte_phys = pgtable_alloc(PAGE_SHIFT);
  __pmd_populate(pmdp, pte_phys, PMD_TYPE_TABLE);
  pmd = READ_ONCE(*pmdp);
  }
  BUG_ON(pmd_bad(pmd));
 
  do {
  pgprot_t __prot = prot;
 
  next = pte_cont_addr_end(addr, end); ----------3
 
  /* use a contiguous mapping if the range is suitably aligned */
  if ((((addr | next | phys) & ~CONT_PTE_MASK) == 0) &&
      (flags & NO_CONT_MAPPINGS) == 0)
  __prot = __pgprot(pgprot_val(prot) | PTE_CONT);--------4
 
  init_pte(pmdp, addr, next, phys, __prot); ---------5
 
  phys += next - addr;
  } while (addr = next, addr != end);
}

1.If the section mapping has been established, go directly to bug_on. 
 Section mappings are the mappings involved in the kernel startup phase, which should not exist here.
2.If the content of the pmd page table entry is empty, it means that the next-level pte page table does not exist, 
  and 512 pte page table entries need to be dynamically allocated. 
  Then set the relationship between pmd page table entries and pte page table contents through pmd_populate.
3.Take the size of PAGE_SIZE as the step size, and set the pte page table entries cyclically
4.The continuous mapping flag will be set when the content and size of the virtual address and physical address are aligned
  When mapping continuous physical pages, the virtual address mapping and linear address mapping of 
  the kernel image will enable the following continuous range of sizes
  
  granule size |  cont PTE  |  cont PMD  |
          -------------+------------+------------+
               4 KB    |    64 KB   |   32 MB    |
              16 KB    |     2 MB   |    1 GB*   |
              64 KB    |     2 MB   |   16 GB*   |


5.init_pte fills the contents of the pte page table.

init_pte():

static void init_pte(pmd_t *pmdp, unsigned long addr, unsigned long end,
       phys_addr_t phys, pgprot_t prot)
{
  pte_t *ptep;
 
    ptep = pte_set_fixmap_offset(pmdp, addr); ---------------1
  do {
pte_t old_pte = READ_ONCE(*ptep);
 
  set_pte(ptep, pfn_pte(__phys_to_pfn(phys), prot)); -----------2
 
  /*
  * After the PTE entry has been populated once, we
  * only allow updates to the permission attributes.
  */
  BUG_ON(!pgattr_change_is_safe(pte_val(old_pte),
        READ_ONCE(pte_val(*ptep))));
 
  phys += PAGE_SIZE;
  } while (ptep++, addr += PAGE_SIZE, addr != end);
 
  pte_clear_fixmap();
}

1. Map the pte table to fixmap and find the pte page table entry corresponding to the virtual address addr.
2 .Loop traversal to set the pte page table entry. The physical address is converted to 
 the physical page frame number pfn through __phys_to_pfn, and then combined with 
 the prot tag (setting the read and write attributes of the physical page) to form the entry structure of the PTE page table. Then write the physical address of the combined pte entry into the corresponding pte entry.
 
The calling process of the overall function is shown in the figure below:


Note:

From the above code flow analysis, it can be seen that although the page table entries of pgd/pud/pmd/pte are all physical addresses.

The calculation and analysis of pgd/pud/pmd/pte above are all based on virtual addresses.


In the paging_init, we are trying to convert the virtual address to the physical address.

The kernel needs to access the physical address corresponding to the virtual address virt_addr as the content on phys:
1.Get the physical address of the swapper_pg_dir page table by storing the register ttbr1 of the kernel page table, 
  and then convert it to the virtual address of the pgd page table.
2. Calculate the corresponding pgd entry (the address of the pgd page table + the offset calculated by virt_addr) 
   according to virt addr, the PGD entry stores the physical address of the PUD page table, and then converts 
   it to the virtual address of the base address of the PUD page table.
3. PUD and PMD process are same.
4. Finally, the virtual address of the PTE page table is found from the PMD entry, and 
   the corresponding pte entry is calculated according to virt addr. The physical page frame address 
   where the phys is located is obtained from the pte entry.
5. Add the offset calculated according to virt addr to get the physical address corresponding to virt.






No comments:

Post a Comment