1. The document discusses Linux kernel page reclamation.
2. Direct reclaim is when the caller performs reclamation directly, while daemon reclaim uses kswapd processes.
3. Daemon reclaim involves kswapd processes waking up and using kswapd_shrink_zone() to reclaim pages until all zones are above the high watermark. This helps balance memory usage across zones.
3. What¨s Page Frame
? page frame = A page-sized/aligned piece of RAM!
? struct page = An one-on-one structure in kernel for each page frame!
? mem_map!
? Unique array of struct page's which covers all RAM that a kernel
manages.!
? but in CONFIG_SPARSEMEM environment!
? There's no unique mem_map.!
? Instead, there's a list of 2MB-sized arrays of struct page's.!
? You must use __pfn_to_page(), __page_to_pfn() or wrappers of them.
4. What¨s NUMA
? NUMA(Non-Uniform Memory Architecture)!
? System is comprised of nodes.!
? Each node is de?ned by a set of CPUs and one physical memory range.!
? Memory access latency differs depending on source and destination nodes.!
? NUMA con?guration!
? ACPI provides NUMA con?guration:!
? SRAT(Static Resource Af?nity Table)!
? To know which CPUs and memory range are contained in which NUMA
node?!
? SLIT(System Locality Information Table)!
? To know how far a NUMA node is from another node?
5. What¨s Memory Zone
? Physical memory is separated by address range:!
? ZONE_DMA: <16MB!
? ZONE_DMA32: <4GB!
? ZONE_NORMAL: the rest!
? ZONE_MOVABLE: none by default.!
? This is used to de?ne a hot-removable physical
memory range.
6. struct pglist_data {!
struct zone node_zone[MAX_NR_ZONES];!
};
Memory node, zone
麗尖アドレス Range1 Range2
CPU1 CPU2 CPU3 CPU4
struct pglist_data {!
struct zone node_zone[MAX_NR_ZONES];!
´!
};
NUMA node1 NUMA node2
? どのpglist_dataにも光ZONE(DMA゛MOVABLE)に鬉垢
zone夛悶が喘吭される(徽し匯何の嶄附は腎かもしれない)
7. Memory Allocation
1. At ?rst, checks threshold for each zone?
(threshold = watermark and dirty-ratio).!
? If all zones are failed, the kernel goes into page reclaim
path (=today¨s topic).!
2. If some zone is ok, allocates a page from the zone¨s buddy
system.!
? 0-order page is allocated from per-cpu cache.!
? higher order page is obtained from per-order lists of pages
8. Memory Deallocation
? Page is returned to buddy system.!
? 0-order page is returned to per-cpu cache via
free_hot_cold_page().!
? Cold page: A page estimated not to be on CPU cache!
? This is linked to the tail of LRU list of the per-cpu cache.!
? Hot page: A page estimated to be on CPU cache!
? This is linked to the head of LRU list of the per-cpu cache.!
? higher order page is directly returned to per-order lists of pages.
9. Buddy System
4k 4k 4k
8k 8k 8k
4m 4m 4m
???
Per-cpu cache
4k 4k 4k
Per-zone buddy system
order0?
(de)alloc
HOT COLD
order1
order10
???
11. ペ`ジ護輝フロ`の畽
? __alloc_pages_nodemask(ペ`ジ護輝児云v方)!
? get_page_from_freelist(1st: local zones, low wmark)?★?get_page_from_freelist(2nd: all zones)!
? __alloc_pages_slowpath!
1. wake_all_kswapds(kswapd_の軟寛)!
2. get_page_from_freelist(3rd: all zones, min wmark)!
3. if {__GFP,PF}_MEMALLOC?★?__alloc_pages_high_priority!
4. __alloc_pages_direct_compact(掲揖豚議)!
5. __alloc_pages_direct_reclaim(云コンテキストで岷俊ペ`ジ指)!
6. if not did_some_progress?★?__alloc_pages_may_oom!
7. リトライ(2.へ)?嗽は?__alloc_pages_direct_compact(揖豚議)
15. do_try_to_free_pages()
? Core function for page reclaim, which is called at 3 different scenes!
? try_to_free_pages()?★?Global reclaim path via __alloc_pages_nodemask()!
? try_to_free_mem_cgroup_pages()?★?Per-memcg reclaim path!
? Right before per-memcg slab allocation!
? Right before per-memcg ?le page allocation!
? Right before per-memcg anon page allocation!
? Right before per-memcg swapin allocation!
? shrink_all_memory()?★?Hibernation path!
? Arguments: (1)struct zonelist *zonelist (2)struct scan_control *sc
16. struct scan_control
struct scan_control {!
! unsigned long nr_scanned;!
! unsigned long nr_reclaimed;!
! unsigned long nr_to_reclaim;!
! ´!
! int swappiness; // 0..100!
! ´!
! struct mem_cgroup *target_mem_cgroup;!
! ´!
! nodemask_t! *nodemask;!
};!
20. shrink_list()
? shrink_{active or inactive}_listを柵ぶ、徽し、activeリストを
shrinkするのは、となるinactiveリストより寄きい栽のみ!
1. if ACTIVEなリストを峺協:!
? if size of lru(ACTIVE) > size of lru(INACTIVE):!
? shrink_active_list!
2. else:!
? shrink_inactive_list
21. shrink_{active,inactive}_list
? shrink_active_list()!
1. Traverse pages in an active list!
2. Find inactive pages in the list and move them to an
inactive list!
? shrink_inactive_list()!
? foreach page:!
1. page_mapped(page) => try_to_unmap(page)!
2. if PageDirty(page) => pageout(page)
23. try_to_unmap()
? Unmap a speci?ed page from all corresponding mappings!
1. Set up struct rmap_walk_control.!
2. rmap_walk_{?le, anon, or ksm}!
? rmap walk is iterating VMAs and unmapping from it!
A. ?le: traverse address_space::i_mmap tree!
B. anon: traverse anon_vma tree!
C. ksm: traverse all merged anon_vma trees!
? each operation is similar to that for anon
28. kswapd
? Processing overview!
1. Wake up!
2. balance_pgdat()!
3. Sleep!
? balance_pgdat()!
? Work until all zones of pgdat are at or over hi-wmark.!
? reclaim function: kswapd_shrink_zone()