Linux物理内存管理

2023-06-08 23:52| 来源: 网络整理| 查看: 265

这段时间，因为需要研究如何在内核里为应用程序做冷页迁移（把冷页从RAM迁移到NVM），就需要知道Linux是如何管理物理内存的，包括：

Linux如何知道存在哪些可用的物理内存（以及其类型）； Linux如何划分各种内存区域（DMA、NORMAL和HIGHMEM等）； NUMA节点是如何创建的；这三者的关系是怎样的。

研究是基于Linux 4.17.19的代码，实验环境是qemu-kvm中的Ubuntu Server 16.04。环境搭建过程如《安装qemu-kvm以及配置桥接网络》所示。

更新好4.17.19的内核之后，使用如下命令启动qemu，指定512MB的内存：

qemu-system-x86_64 -m 512 -enable-kvm -smp 4 -vga std ubuntu.img

进入系统后，执行dmesg查看内核日志：

[ 0.000000] Linux version 4.17.19 (zjs@ubuntu) (gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.10)) #15 SMP Fri Oct 5 16:37:09 CST 2018 [ 0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-4.17.19 root=UUID=895bf2a8-6111-4d7b-92dd-e76ab8b8265a ro [ 0.000000] x86/fpu: x87 FPU will use FXSAVE [ 0.000000] e820: BIOS-provided physical RAM map: [ 0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009fbff] usable [ 0.000000] BIOS-e820: [mem 0x000000000009fc00-0x000000000009ffff] reserved [ 0.000000] BIOS-e820: [mem 0x00000000000f0000-0x00000000000fffff] reserved [ 0.000000] BIOS-e820: [mem 0x0000000000100000-0x000000001fffdfff] usable [ 0.000000] BIOS-e820: [mem 0x000000001fffe000-0x000000001fffffff] reserved [ 0.000000] BIOS-e820: [mem 0x00000000feffc000-0x00000000feffffff] reserved [ 0.000000] BIOS-e820: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved [ 0.000000] NX (Execute Disable) protection: active [ 0.000000] SMBIOS 2.4 present. [ 0.000000] DMI: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 [ 0.000000] e820: update [mem 0x00000000-0x00000fff] usable ==> reserved [ 0.000000] e820: remove [mem 0x000a0000-0x000fffff] usable [ 0.000000] e820: last_pfn = 0x1fffe max_arch_pfn = 0x400000000 [ 0.000000] MTRR default type: write-back [ 0.000000] MTRR fixed ranges enabled: [ 0.000000] 00000-9FFFF write-back [ 0.000000] A0000-BFFFF uncachable [ 0.000000] C0000-FFFFF write-protect [ 0.000000] MTRR variable ranges enabled: [ 0.000000] 0 base 0080000000 mask FF80000000 uncachable [ 0.000000] 1 disabled [ 0.000000] 2 disabled [ 0.000000] 3 disabled [ 0.000000] 4 disabled [ 0.000000] 5 disabled [ 0.000000] 6 disabled [ 0.000000] 7 disabled [ 0.000000] x86/PAT: PAT not supported by CPU. [ 0.000000] x86/PAT: Configuration [0-7]: WB WT UC- UC WB WT UC- UC [ 0.000000] found SMP MP-table at [mem 0x000f0ae0-0x000f0aef] mapped at [ (ptrval)] [ 0.000000] Scanning 1 areas for low memory corruption [ 0.000000] Base memory trampoline at [ (ptrval)] 99000 size 24576 [ 0.000000] BRK [0x04901000, 0x04901fff] PGTABLE [ 0.000000] BRK [0x04902000, 0x04902fff] PGTABLE [ 0.000000] BRK [0x04903000, 0x04903fff] PGTABLE [ 0.000000] BRK [0x04904000, 0x04904fff] PGTABLE [ 0.000000] BRK [0x04905000, 0x04905fff] PGTABLE [ 0.000000] RAMDISK: [mem 0x1ea9d000-0x1f24bfff] [ 0.000000] ACPI: Early table checksum verification disabled [ 0.000000] ACPI: RSDP 0x00000000000F08D0 000014 (v00 BOCHS ) [ 0.000000] ACPI: RSDT 0x000000001FFFFCFC 000034 (v01 BOCHS BXPCRSDT 00000001 BXPC 00000001) [ 0.000000] ACPI: FACP 0x000000001FFFF1C0 000074 (v01 BOCHS BXPCFACP 00000001 BXPC 00000001) [ 0.000000] ACPI: DSDT 0x000000001FFFE040 001180 (v01 BOCHS BXPCDSDT 00000001 BXPC 00000001) [ 0.000000] ACPI: FACS 0x000000001FFFE000 000040 [ 0.000000] ACPI: SSDT 0x000000001FFFF234 000A00 (v01 BOCHS BXPCSSDT 00000001 BXPC 00000001) [ 0.000000] ACPI: APIC 0x000000001FFFFC34 000090 (v01 BOCHS BXPCAPIC 00000001 BXPC 00000001) [ 0.000000] ACPI: HPET 0x000000001FFFFCC4 000038 (v01 BOCHS BXPCHPET 00000001 BXPC 00000001) [ 0.000000] ACPI: Local APIC address 0xfee00000 [ 0.000000] No NUMA configuration found [ 0.000000] Faking a node at [mem 0x0000000000000000-0x000000001fffdfff] [ 0.000000] NODE_DATA(0) allocated [mem 0x1fffa000-0x1fffdfff] [ 0.000000] tsc: Fast TSC calibration using PIT [ 0.000000] Zone ranges: [ 0.000000] DMA [mem 0x0000000000001000-0x0000000000ffffff] [ 0.000000] DMA32 [mem 0x0000000001000000-0x000000001fffdfff] [ 0.000000] Normal empty [ 0.000000] Movable zone start for each node [ 0.000000] Early memory node ranges [ 0.000000] node 0: [mem 0x0000000000001000-0x000000000009efff] [ 0.000000] node 0: [mem 0x0000000000100000-0x000000001fffdfff] [ 0.000000] Reserved but unavailable: 100 pages [ 0.000000] Initmem setup node 0 [mem 0x0000000000001000-0x000000001fffdfff] [ 0.000000] On node 0 totalpages: 130972 [ 0.000000] DMA zone: 64 pages used for memmap [ 0.000000] DMA zone: 21 pages reserved [ 0.000000] DMA zone: 3998 pages, LIFO batch:0 [ 0.000000] DMA32 zone: 1984 pages used for memmap [ 0.000000] DMA32 zone: 126974 pages, LIFO batch:31

===========================阶段一：从BIOS获取物理内存布局=========================

从dmesg的信息可以看出，最先输出的是“e820: BIOS-provided physical RAM map:”，如上红色部分。很明显，这是从BIOS获取了内存布局列表。那么为何名为e820呢？Wiki上的介绍为：

也就是说，通过15号中断（并在AX寄存器中指定0xE820），可以从BIOS获取物理内存布局信息。

正常情况下，汇编代码调用arch/x86/boot/main.c中的main()函数，main()函数会调用detect_memory()函数，其代码如下：

void main(void) { /* First, copy the boot header into the "zeropage" */ copy_boot_params(); /* Initialize the early-boot console */ console_init(); if (cmdline_find_option_bool("debug")) puts("early console in setup code\n"); /* End of heap check */ init_heap(); /* Make sure we have all the proper CPU support */ if (validate_cpu()) { puts("Unable to boot - please use a kernel appropriate " "for your CPU.\n"); die(); } /* Tell the BIOS what CPU mode we intend to run in. */ set_bios_mode(); /* Detect memory layout */ detect_memory(); /* Set keyboard repeat rate (why?) and query the lock flags */ keyboard_init(); /* Query Intel SpeedStep (IST) information */ query_ist(); /* Query APM information */ #if defined(CONFIG_APM) || defined(CONFIG_APM_MODULE) query_apm_bios(); #endif /* Query EDD information */ #if defined(CONFIG_EDD) || defined(CONFIG_EDD_MODULE) query_edd(); #endif /* Set the video mode */ set_video(); /* Do the last things and invoke protected mode */ go_to_protected_mode(); }

detect_memory()定义在arch/x86/boot/memory.c中：

int detect_memory(void) { int err = -1; if (detect_memory_e820() > 0) err = 0; if (!detect_memory_e801()) err = 0; if (!detect_memory_88()) err = 0; return err; }

其逻辑很简单，先试试用e820协议从BIOS获取内存布局，不成功的话试试e801，再试88。一般而言都是e820就可以了，于是就看detect_memory_e820的实现：

static int detect_memory_e820(void) { int count = 0; struct biosregs ireg, oreg; struct boot_e820_entry *desc = boot_params.e820_table; static struct boot_e820_entry buf; /* static so it is zeroed */ initregs(&ireg); ireg.ax = 0xe820; ireg.cx = sizeof buf; ireg.edx = SMAP; ireg.di = (size_t)&buf; do { intcall(0x15, &ireg, &oreg); ireg.ebx = oreg.ebx; /* for next iteration... */ /* BIOSes which terminate the chain with CF = 1 as opposed to %ebx = 0 don't always report the SMAP signature on the final, failing, probe. */ if (oreg.eflags & X86_EFLAGS_CF) break; /* Some BIOSes stop returning SMAP in the middle of the search loop. We don't know exactly how the BIOS screwed up the map at that point, we might have a partial map, the full map, or complete garbage, so just return failure. */ if (oreg.eax != SMAP) { count = 0; break; } *desc++ = buf; count++; } while (ireg.ebx && count < ARRAY_SIZE(boot_params.e820_table)); return boot_params.e820_entries = count; }

逻辑很简单，每次用INT 15向BIOS索要一条entry信息，最后把表放在boot_params.e820_table里。OK，与BIOS打交道的故事到此结束~接着讲kernel怎么处理从BIOS获取的boot_params.e820_table。

=======================阶段二：将boot_params.e820_table转换成kernel需要的e820_table===================

在arch/x86/kernel/x86_init.c中，用函数指针指定了内存初始化的函数： /* * The platform setup functions are preset with the default functions * for standard PC hardware. */ struct x86_init_ops x86_init __initdata = { .resources = { .probe_roms = probe_roms, .reserve_resources = reserve_standard_io_resources, .memory_setup = e820__memory_setup_default, }, // 代码略 };

接着，在arch/x86/kernel/setup.c中，调用了e820__memory_setup()。该函数定义在arch/x86/kernel/e820.c中：

void __init e820__memory_setup(void) { char *who; /* This is a firmware interface ABI - make sure we don't break it: */ BUILD_BUG_ON(sizeof(struct boot_e820_entry) != 20); who = x86_init.resources.memory_setup(); memcpy(e820_table_kexec, e820_table, sizeof(*e820_table_kexec)); memcpy(e820_table_firmware, e820_table, sizeof(*e820_table_firmware)); pr_info("e820: BIOS-provided physical RAM map:\n"); e820__print_table(who); }

这里的x86_init.resources.memory_setup正指向e820__memory_setup_default函数。e820__memory_setup_default()最重要的工作，莫过于把boot_params.e820_table这张表转换成kernel的e820_table表。之后，e820__memory_setup()又把kernel的e820_table复制到了e820_table_kexec和e820_table_firmware两份。最后打印出dmesg中看到的那些消息。

之所以要做转换，是因为boot_params.e820_table的entry的类型是struct boot_e820_entry，而kernel需要的e820_table的entry的类型是struct e820_type。kernel需要的这个表的类型定义在arch/x86/include/asm/e820/types.h，如下：

enum e820_type { E820_TYPE_RAM = 1, E820_TYPE_RESERVED = 2, E820_TYPE_ACPI = 3, E820_TYPE_NVS = 4, E820_TYPE_UNUSABLE = 5, E820_TYPE_PMEM = 7, E820_TYPE_PRAM = 12, E820_TYPE_RESERVED_KERN = 128, }; struct e820_entry { u64 addr; u64 size; enum e820_type type; } __attribute__((packed)); struct e820_table { __u32 nr_entries; struct e820_entry entries[E820_MAX_ENTRIES]; };

在arch/x86/kernel/e820.c中，定义了一系列对e820_table进行增删改查的函数，比如向表追加一项的函数：

/* * Add a memory region to the kernel E820 map. */ static void __init __e820__range_add(struct e820_table *table, u64 start, u64 size, enum e820_type type) { int x = table->nr_entries; if (x >= ARRAY_SIZE(table->entries)) { pr_err("e820: too many entries; ignoring [mem %#010llx-%#010llx]\n", start, start + size - 1); return; } table->entries[x].addr = start; table->entries[x].size = size; table->entries[x].type = type; table->nr_entries++; }

通过该函数就能发现，e820_table其实是一个很简单的数组。相应的还有删除操作：

/* Remove a range of memory from the E820 table: */ u64 __init e820__range_remove(u64 start, u64 size, enum e820_type old_type, bool check_type) { // 代码略 }

还有对表进行排序、去重的操作：

int __init e820__update_table(struct e820_table *table) { // 代码略 }

转化的代码非常简单：

static int __init __append_e820_table(struct boot_e820_entry *entries, u32 nr_entries) { struct boot_e820_entry *entry = entries; while (nr_entries) { u64 start = entry->addr; u64 size = entry->size; u64 end = start + size - 1; u32 type = entry->type; /* Ignore the entry on 64-bit overflow: */ if (start > end && likely(size)) return -1; e820__range_add(start, size, type); entry++; nr_entries--; } return 0; }

可以看出boot_e820_entry与e820_entry之间其实没有区别。。。kernel这么“多此一举”，应该是为了兼容将来不同的boot_e820_entry的变化吧。

而e820__memory_setup_default()的实现就是调用了上述函数：

/* * Pass the firmware (bootloader) E820 map to the kernel and process it: */ char *__init e820__memory_setup_default(void) { char *who = "BIOS-e820"; /* * Try to copy the BIOS-supplied E820-map. * * Otherwise fake a memory map; one section from 0k->640k, * the next section from 1mb->appropriate_mem_k */ if (append_e820_table(boot_params.e820_table, boot_params.e820_entries) < 0) { u64 mem_size; /* Compare results from other methods and take the one that gives more RAM: */ if (boot_params.alt_mem_k < boot_params.screen_info.ext_mem_k) { mem_size = boot_params.screen_info.ext_mem_k; who = "BIOS-88"; } else { mem_size = boot_params.alt_mem_k; who = "BIOS-e801"; } e820_table->nr_entries = 0; e820__range_add(0, LOWMEMSIZE(), E820_TYPE_RAM); e820__range_add(HIGH_MEMORY, mem_size bits); }

所以很明了，numa_nodes_parsed的第i位置为1则表示节点i可用。

那么numa_add_memblks()是怎么做的呢？

static int __init numa_add_memblk_to(int nid, u64 start, u64 end, struct numa_meminfo *mi) { /* ignore zero length blks */ if (start == end) return 0; /* whine about and ignore invalid blks */ if (start > end || nid < 0 || nid >= MAX_NUMNODES) { pr_warning("NUMA: Warning: invalid memblk node %d [mem %#010Lx-%#010Lx]\n", nid, start, end - 1); return 0; } if (mi->nr_blks >= NR_NODE_MEMBLKS) { pr_err("NUMA: too many memblk ranges\n"); return -EINVAL; } mi->blk[mi->nr_blks].start = start; mi->blk[mi->nr_blks].end = end; mi->blk[mi->nr_blks].nid = nid; mi->nr_blks++; return 0; } /** * numa_add_memblk - Add one numa_memblk to numa_meminfo * @nid: NUMA node ID of the new memblk * @start: Start address of the new memblk * @end: End address of the new memblk * * Add a new memblk to the default numa_meminfo. * * RETURNS: * 0 on success, -errno on failure. */ int __init numa_add_memblk(int nid, u64 start, u64 end) { return numa_add_memblk_to(nid, start, end, &numa_meminfo); }

看来numa_meminfo就是一个全局的列表，每一项都是(起始地址, 结束地址, 节点编号)的三元组，描述了内存块与NUMA节点的关联关系。当dummy_numa_init()设置好了numa_meminfo后，numa_init()函数就调用了numa_register_memblks(&numa_meminfo)，该函数根据numa_meminfo将每一个memblock与NUMA节点号关联，并且创建每一个NUMA节点。其逻辑如下：

static int __init numa_register_memblks(struct numa_meminfo *mi) { unsigned long uninitialized_var(pfn_align); int i, nid; /* Account for nodes with cpus and no memory */ node_possible_map = numa_nodes_parsed; numa_nodemask_from_meminfo(&node_possible_map, mi); if (WARN_ON(nodes_empty(node_possible_map))) return -EINVAL; for (i = 0; i < mi->nr_blks; i++) { struct numa_memblk *mb = &mi->blk[i]; memblock_set_node(mb->start, mb->end - mb->start, &memblock.memory, mb->nid); } /* * At very early time, the kernel have to use some memory such as * loading the kernel image. We cannot prevent this anyway. So any * node the kernel resides in should be un-hotpluggable. * * And when we come here, alloc node data won't fail. */ numa_clear_kernel_node_hotplug(); /* * If sections array is gonna be used for pfn -> nid mapping, check * whether its granularity is fine enough. */ #ifdef NODE_NOT_IN_PAGE_FLAGS pfn_align = node_map_pfn_alignment(); if (pfn_align && pfn_align < PAGES_PER_SECTION) { printk(KERN_WARNING "Node alignment %LuMB < min %LuMB, rejecting NUMA config\n", PFN_PHYS(pfn_align) >> 20, PFN_PHYS(PAGES_PER_SECTION) >> 20); return -EINVAL; } #endif if (!numa_meminfo_cover_memory(mi)) return -EINVAL; /* Finally register nodes. */ for_each_node_mask(nid, node_possible_map) { u64 start = PFN_PHYS(max_pfn); u64 end = 0; for (i = 0; i < mi->nr_blks; i++) { if (nid != mi->blk[i].nid) continue; start = min(mi->blk[i].start, start); end = max(mi->blk[i].end, end); } if (start >= end) continue; /* * Don't confuse VM with a node that doesn't have the * minimum amount of memory: */ if (end && (end - start) < NODE_MIN_SIZE) continue; alloc_node_data(nid); } /* Dump memblock with node info and return. */ memblock_dump_all(); return 0; }

在alloc_node_data()之后，伙伴分配器就能使用了~

在numa_init()的最后，调用了numa_init_array()，该函数用于将各个CPU core与NUMA节点关联，代码很简单：

/* * There are unfortunately some poorly designed mainboards around that * only connect memory to a single CPU. This breaks the 1:1 cpu->node * mapping. To avoid this fill in the mapping for all possible CPUs, * as the number of CPUs is not known yet. We round robin the existing * nodes. */ static void __init numa_init_array(void) { int rr, i; rr = first_node(node_online_map); for (i = 0; i < nr_cpu_ids; i++) { if (early_cpu_to_node(i) != NUMA_NO_NODE) continue; numa_set_node(i, rr); rr = next_node_in(rr, node_online_map); } }

CPU core 0关联到NUMA node 0，CPU core 1关联到NUMA node 1，CPU core 2关联到NUMA node 0，CPU core 3关联到NUMA node 1，如此循环。

OK，全部结束！

=================================最后小试验====================================

既然知道了Linux是在哪里、是怎么创建NUMA节点的，那么就让我试试“虚构”一个NUMA节点。

首先，看看当前NUMA的结构：

只有一个节点，四个CPU core都挂在上面，而且节点0有481MB的内存。

把dummy_numa_init改成这样：

/** * dummy_numa_init - Fallback dummy NUMA init * * Used if there's no underlying NUMA architecture, NUMA initialization * fails, or NUMA is disabled on the command line. * * Must online at least one node and add memory blocks that cover all * allowed memory. This function must not fail. */ static int __init dummy_numa_init(void) { printk(KERN_INFO "%s\n", numa_off ? "NUMA turned off" : "No NUMA configuration found"); printk(KERN_INFO "Faking a node at [mem %#018Lx-%#018Lx]\n", 0LLU, PFN_PHYS(max_pfn) - 1); //node_set(0, numa_nodes_parsed); //numa_add_memblk(0, 0, PFN_PHYS(max_pfn)); node_set(0, numa_nodes_parsed); numa_add_memblk(0, 0, PFN_PHYS(max_pfn) - (32

【本文地址】

公司简介

联系我们