際際滷

際際滷Share a Scribd company logo
Linuxカ`ネル
ペ`ジ指
耳弥囘@siburu!
2014/7/27(Sun)
1. 念指のあらすじ
What¨s Page Frame
? page frame = A page-sized/aligned piece of RAM!
? struct page = An one-on-one structure in kernel for each page frame!
? mem_map!
? Unique array of struct page's which covers all RAM that a kernel
manages.!
? but in CONFIG_SPARSEMEM environment!
? There's no unique mem_map.!
? Instead, there's a list of 2MB-sized arrays of struct page's.!
? You must use __pfn_to_page(), __page_to_pfn() or wrappers of them.
What¨s NUMA
? NUMA(Non-Uniform Memory Architecture)!
? System is comprised of nodes.!
? Each node is de?ned by a set of CPUs and one physical memory range.!
? Memory access latency differs depending on source and destination nodes.!
? NUMA con?guration!
? ACPI provides NUMA con?guration:!
? SRAT(Static Resource Af?nity Table)!
? To know which CPUs and memory range are contained in which NUMA
node?!
? SLIT(System Locality Information Table)!
? To know how far a NUMA node is from another node?
What¨s Memory Zone
? Physical memory is separated by address range:!
? ZONE_DMA: <16MB!
? ZONE_DMA32: <4GB!
? ZONE_NORMAL: the rest!
? ZONE_MOVABLE: none by default.!
? This is used to de?ne a hot-removable physical
memory range.
struct pglist_data {!
struct zone node_zone[MAX_NR_ZONES];!
};
Memory node, zone
麗尖アドレス Range1 Range2
CPU1 CPU2 CPU3 CPU4
struct pglist_data {!
struct zone node_zone[MAX_NR_ZONES];!
´!
};
NUMA node1 NUMA node2
? どのpglist_dataにも光ZONE(DMA゛MOVABLE)に鬉垢
zone夛悶が喘吭される(徽し匯何の嶄附は腎かもしれない)
Memory Allocation
1. At ?rst, checks threshold for each zone?
(threshold = watermark and dirty-ratio).!
? If all zones are failed, the kernel goes into page reclaim
path (=today¨s topic).!
2. If some zone is ok, allocates a page from the zone¨s buddy
system.!
? 0-order page is allocated from per-cpu cache.!
? higher order page is obtained from per-order lists of pages
Memory Deallocation
? Page is returned to buddy system.!
? 0-order page is returned to per-cpu cache via
free_hot_cold_page().!
? Cold page: A page estimated not to be on CPU cache!
? This is linked to the tail of LRU list of the per-cpu cache.!
? Hot page: A page estimated to be on CPU cache!
? This is linked to the head of LRU list of the per-cpu cache.!
? higher order page is directly returned to per-order lists of pages.
Buddy System
4k 4k 4k
8k 8k 8k
4m 4m 4m
???
Per-cpu cache
4k 4k 4k
Per-zone buddy system
order0?
(de)alloc
HOT COLD
order1
order10
???
2. ペ`ジの指
2.1 Direct reclaim!
2.2 Daemon reclaim
ペ`ジ護輝フロ`の畽
? __alloc_pages_nodemask(ペ`ジ護輝児云v方)!
? get_page_from_freelist(1st: local zones, low wmark)?★?get_page_from_freelist(2nd: all zones)!
? __alloc_pages_slowpath!
1. wake_all_kswapds(kswapd_の軟寛)!
2. get_page_from_freelist(3rd: all zones, min wmark)!
3. if {__GFP,PF}_MEMALLOC?★?__alloc_pages_high_priority!
4. __alloc_pages_direct_compact(掲揖豚議)!
5. __alloc_pages_direct_reclaim(云コンテキストで岷俊ペ`ジ指)!
6. if not did_some_progress?★?__alloc_pages_may_oom!
7. リトライ(2.へ)?嗽は?__alloc_pages_direct_compact(揖豚議)
2.1 Direct Reclaim
(ペ`ジ護輝勣箔宀云繁による指В
__alloc_pages_direct_reclaim()
? __perform_reclaim!
? current->?ags |= PF_MEMALLOC!
? ペ`ジ指Г倫嚔Lでペ`ジ護輝が駅勣になったrに、o識簧邨屬鯤荒辰任るように!
? try_to_free_pages!
? throttle_direct_reclaim!
? if !pfmemalloc_watermark_ok?★? kswapdによりokになるのを棋C!
? do_try_to_free_pages!
? current->?ags &= ~PF_MEMALLOC!
? get_page_from_freelist!
? drain_all_pages!
? get_page_from_freelist
pfmemalloc_watermark_ok()
? ARGS!
? pgdat(type: struct pglist_data)!
? RETURN!
? type: bool!
? node¨s free_pages > 0.5 * node¨s min_wmark!
? DESC!
? nodeg了で(zoneg了でなく)、フリ`ペ`ジ楚を min watermarkの磯蛍と曳^し、階え
ていればOK!
? 和指っていればfalseを卦すとともに、 輝nodeのkswapdを軟寛!
? メモリ 独したnodeではdirect reclaimはやめて kswapdに販せる、その、Qめるv方。
do_try_to_free_pages()
? Core function for page reclaim, which is called at 3 different scenes!
? try_to_free_pages()?★?Global reclaim path via __alloc_pages_nodemask()!
? try_to_free_mem_cgroup_pages()?★?Per-memcg reclaim path!
? Right before per-memcg slab allocation!
? Right before per-memcg ?le page allocation!
? Right before per-memcg anon page allocation!
? Right before per-memcg swapin allocation!
? shrink_all_memory()?★?Hibernation path!
? Arguments: (1)struct zonelist *zonelist (2)struct scan_control *sc
struct scan_control
struct scan_control {!
! unsigned long nr_scanned;!
! unsigned long nr_reclaimed;!
! unsigned long nr_to_reclaim;!
! ´!
! int swappiness; // 0..100!
! ´!
! struct mem_cgroup *target_mem_cgroup;!
! ´!
! nodemask_t! *nodemask;!
};!
do_try_to_free_pagesのI尖
? 參和屈つのル`プ!
? shrink_zones()!
? 瘁峰!
? wakeup_?usher_threads()!
? shrink_zonesが、指朕(scan_context::nr_to_reclaim)の1.5
蔚參貧のペ`ジをスキャンするたび、柵び竃し。!
? 恷寄で、スキャンした蛍のペ`ジをライトバックするよう、
畠ブロックデバイス(bdi)に勣箔。
shrink_zones()
1. for_each_zone_zonelist_nodemask:!
1. mem_cgroup_soft_limit_reclaim!
? while mem_cgroup_largest_soft_limit_node:!
? mem_cgroup_soft_reclaim!
? shrink_zoneにMむ念に、輝zoneを聞ってる memcgでlimitを階えてるものについて、 ペ`ジ
指ГgませるI尖!
2. shrink_zone!
? foreach mem_cgroup_iter:!
? shrink_lruvec!
? ここでのiterationはGlobal reclaimの栽は root memcgから指!
2. shrink_slab!
? スラブについては肝指參週で???
shrink_lruvec()
? per-zone page freer!
1. get_scan_count!
? 指朕縫擧`ジ方Q協!
2. while 朕卜岸_:!
? shrink_list(LRU_INACTIVE_ANON)!
? shrink_list(LRU_ACTIVE_ANON)!
? shrink_list(LRU_INACTIVE_FILE)!
? shrink_list(LRU_ACTIVE_FILE)!
3. if INACTIVEなo兆メモリだけでは音怎:!
? shrink_active_list
shrink_list()
? shrink_{active or inactive}_listを柵ぶ、徽し、activeリストを
shrinkするのは、となるinactiveリストより寄きい栽のみ!
1. if ACTIVEなリストを峺協:!
? if size of lru(ACTIVE) > size of lru(INACTIVE):!
? shrink_active_list!
2. else:!
? shrink_inactive_list
shrink_{active,inactive}_list
? shrink_active_list()!
1. Traverse pages in an active list!
2. Find inactive pages in the list and move them to an
inactive list!
? shrink_inactive_list()!
? foreach page:!
1. page_mapped(page) => try_to_unmap(page)!
2. if PageDirty(page) => pageout(page)
inactiveなペ`ジとは
? !laptop_modeの栽!
? active LRU listの挑硫から、gに峺協方蛍のペ`ジ
をinactiveなペ`ジとして函誼!
? laptop_modeの栽!
? active LRU listの挑硫から、cleanな峺協方蛍のペ`ジ
をinactiveなペ`ジとして函誼
try_to_unmap()
? Unmap a speci?ed page from all corresponding mappings!
1. Set up struct rmap_walk_control.!
2. rmap_walk_{?le, anon, or ksm}!
? rmap walk is iterating VMAs and unmapping from it!
A. ?le: traverse address_space::i_mmap tree!
B. anon: traverse anon_vma tree!
C. ksm: traverse all merged anon_vma trees!
? each operation is similar to that for anon
A. rmap_walk_file
page
address_space(inode)
i_mmap(type: rb_root)
vma vma vma vma
pgtbl pgtbl pgtbl pgtbl
unmap
B. rmap_walk_anon
page
anon_vma
rb_root(type:rb_root)
vma vma vma vma
pgtbl pgtbl pgtbl pgtbl
unmap
C. rmap_walk_ksm
page
stable_node
hlist
anon!
vma
anon?
vma
anon!
vma
vma vma vma vma
pgtbl pgtbl pgtbl pgtbl
anon!
vma
2.2 Daemon Reclaim
(KSwapDによる旗佩指)
kswapd
? Processing overview!
1. Wake up!
2. balance_pgdat()!
3. Sleep!
? balance_pgdat()!
? Work until all zones of pgdat are at or over hi-wmark.!
? reclaim function: kswapd_shrink_zone()

More Related Content

Page reclaim

  • 3. What¨s Page Frame ? page frame = A page-sized/aligned piece of RAM! ? struct page = An one-on-one structure in kernel for each page frame! ? mem_map! ? Unique array of struct page's which covers all RAM that a kernel manages.! ? but in CONFIG_SPARSEMEM environment! ? There's no unique mem_map.! ? Instead, there's a list of 2MB-sized arrays of struct page's.! ? You must use __pfn_to_page(), __page_to_pfn() or wrappers of them.
  • 4. What¨s NUMA ? NUMA(Non-Uniform Memory Architecture)! ? System is comprised of nodes.! ? Each node is de?ned by a set of CPUs and one physical memory range.! ? Memory access latency differs depending on source and destination nodes.! ? NUMA con?guration! ? ACPI provides NUMA con?guration:! ? SRAT(Static Resource Af?nity Table)! ? To know which CPUs and memory range are contained in which NUMA node?! ? SLIT(System Locality Information Table)! ? To know how far a NUMA node is from another node?
  • 5. What¨s Memory Zone ? Physical memory is separated by address range:! ? ZONE_DMA: <16MB! ? ZONE_DMA32: <4GB! ? ZONE_NORMAL: the rest! ? ZONE_MOVABLE: none by default.! ? This is used to de?ne a hot-removable physical memory range.
  • 6. struct pglist_data {! struct zone node_zone[MAX_NR_ZONES];! }; Memory node, zone 麗尖アドレス Range1 Range2 CPU1 CPU2 CPU3 CPU4 struct pglist_data {! struct zone node_zone[MAX_NR_ZONES];! ´! }; NUMA node1 NUMA node2 ? どのpglist_dataにも光ZONE(DMA゛MOVABLE)に鬉垢 zone夛悶が喘吭される(徽し匯何の嶄附は腎かもしれない)
  • 7. Memory Allocation 1. At ?rst, checks threshold for each zone? (threshold = watermark and dirty-ratio).! ? If all zones are failed, the kernel goes into page reclaim path (=today¨s topic).! 2. If some zone is ok, allocates a page from the zone¨s buddy system.! ? 0-order page is allocated from per-cpu cache.! ? higher order page is obtained from per-order lists of pages
  • 8. Memory Deallocation ? Page is returned to buddy system.! ? 0-order page is returned to per-cpu cache via free_hot_cold_page().! ? Cold page: A page estimated not to be on CPU cache! ? This is linked to the tail of LRU list of the per-cpu cache.! ? Hot page: A page estimated to be on CPU cache! ? This is linked to the head of LRU list of the per-cpu cache.! ? higher order page is directly returned to per-order lists of pages.
  • 9. Buddy System 4k 4k 4k 8k 8k 8k 4m 4m 4m ??? Per-cpu cache 4k 4k 4k Per-zone buddy system order0? (de)alloc HOT COLD order1 order10 ???
  • 10. 2. ペ`ジの指 2.1 Direct reclaim! 2.2 Daemon reclaim
  • 11. ペ`ジ護輝フロ`の畽 ? __alloc_pages_nodemask(ペ`ジ護輝児云v方)! ? get_page_from_freelist(1st: local zones, low wmark)?★?get_page_from_freelist(2nd: all zones)! ? __alloc_pages_slowpath! 1. wake_all_kswapds(kswapd_の軟寛)! 2. get_page_from_freelist(3rd: all zones, min wmark)! 3. if {__GFP,PF}_MEMALLOC?★?__alloc_pages_high_priority! 4. __alloc_pages_direct_compact(掲揖豚議)! 5. __alloc_pages_direct_reclaim(云コンテキストで岷俊ペ`ジ指)! 6. if not did_some_progress?★?__alloc_pages_may_oom! 7. リトライ(2.へ)?嗽は?__alloc_pages_direct_compact(揖豚議)
  • 13. __alloc_pages_direct_reclaim() ? __perform_reclaim! ? current->?ags |= PF_MEMALLOC! ? ペ`ジ指Г倫嚔Lでペ`ジ護輝が駅勣になったrに、o識簧邨屬鯤荒辰任るように! ? try_to_free_pages! ? throttle_direct_reclaim! ? if !pfmemalloc_watermark_ok?★? kswapdによりokになるのを棋C! ? do_try_to_free_pages! ? current->?ags &= ~PF_MEMALLOC! ? get_page_from_freelist! ? drain_all_pages! ? get_page_from_freelist
  • 14. pfmemalloc_watermark_ok() ? ARGS! ? pgdat(type: struct pglist_data)! ? RETURN! ? type: bool! ? node¨s free_pages > 0.5 * node¨s min_wmark! ? DESC! ? nodeg了で(zoneg了でなく)、フリ`ペ`ジ楚を min watermarkの磯蛍と曳^し、階え ていればOK! ? 和指っていればfalseを卦すとともに、 輝nodeのkswapdを軟寛! ? メモリ 独したnodeではdirect reclaimはやめて kswapdに販せる、その、Qめるv方。
  • 15. do_try_to_free_pages() ? Core function for page reclaim, which is called at 3 different scenes! ? try_to_free_pages()?★?Global reclaim path via __alloc_pages_nodemask()! ? try_to_free_mem_cgroup_pages()?★?Per-memcg reclaim path! ? Right before per-memcg slab allocation! ? Right before per-memcg ?le page allocation! ? Right before per-memcg anon page allocation! ? Right before per-memcg swapin allocation! ? shrink_all_memory()?★?Hibernation path! ? Arguments: (1)struct zonelist *zonelist (2)struct scan_control *sc
  • 16. struct scan_control struct scan_control {! ! unsigned long nr_scanned;! ! unsigned long nr_reclaimed;! ! unsigned long nr_to_reclaim;! ! ´! ! int swappiness; // 0..100! ! ´! ! struct mem_cgroup *target_mem_cgroup;! ! ´! ! nodemask_t! *nodemask;! };!
  • 17. do_try_to_free_pagesのI尖 ? 參和屈つのル`プ! ? shrink_zones()! ? 瘁峰! ? wakeup_?usher_threads()! ? shrink_zonesが、指朕(scan_context::nr_to_reclaim)の1.5 蔚參貧のペ`ジをスキャンするたび、柵び竃し。! ? 恷寄で、スキャンした蛍のペ`ジをライトバックするよう、 畠ブロックデバイス(bdi)に勣箔。
  • 18. shrink_zones() 1. for_each_zone_zonelist_nodemask:! 1. mem_cgroup_soft_limit_reclaim! ? while mem_cgroup_largest_soft_limit_node:! ? mem_cgroup_soft_reclaim! ? shrink_zoneにMむ念に、輝zoneを聞ってる memcgでlimitを階えてるものについて、 ペ`ジ 指ГgませるI尖! 2. shrink_zone! ? foreach mem_cgroup_iter:! ? shrink_lruvec! ? ここでのiterationはGlobal reclaimの栽は root memcgから指! 2. shrink_slab! ? スラブについては肝指參週で???
  • 19. shrink_lruvec() ? per-zone page freer! 1. get_scan_count! ? 指朕縫擧`ジ方Q協! 2. while 朕卜岸_:! ? shrink_list(LRU_INACTIVE_ANON)! ? shrink_list(LRU_ACTIVE_ANON)! ? shrink_list(LRU_INACTIVE_FILE)! ? shrink_list(LRU_ACTIVE_FILE)! 3. if INACTIVEなo兆メモリだけでは音怎:! ? shrink_active_list
  • 20. shrink_list() ? shrink_{active or inactive}_listを柵ぶ、徽し、activeリストを shrinkするのは、となるinactiveリストより寄きい栽のみ! 1. if ACTIVEなリストを峺協:! ? if size of lru(ACTIVE) > size of lru(INACTIVE):! ? shrink_active_list! 2. else:! ? shrink_inactive_list
  • 21. shrink_{active,inactive}_list ? shrink_active_list()! 1. Traverse pages in an active list! 2. Find inactive pages in the list and move them to an inactive list! ? shrink_inactive_list()! ? foreach page:! 1. page_mapped(page) => try_to_unmap(page)! 2. if PageDirty(page) => pageout(page)
  • 22. inactiveなペ`ジとは ? !laptop_modeの栽! ? active LRU listの挑硫から、gに峺協方蛍のペ`ジ をinactiveなペ`ジとして函誼! ? laptop_modeの栽! ? active LRU listの挑硫から、cleanな峺協方蛍のペ`ジ をinactiveなペ`ジとして函誼
  • 23. try_to_unmap() ? Unmap a speci?ed page from all corresponding mappings! 1. Set up struct rmap_walk_control.! 2. rmap_walk_{?le, anon, or ksm}! ? rmap walk is iterating VMAs and unmapping from it! A. ?le: traverse address_space::i_mmap tree! B. anon: traverse anon_vma tree! C. ksm: traverse all merged anon_vma trees! ? each operation is similar to that for anon
  • 25. B. rmap_walk_anon page anon_vma rb_root(type:rb_root) vma vma vma vma pgtbl pgtbl pgtbl pgtbl unmap
  • 28. kswapd ? Processing overview! 1. Wake up! 2. balance_pgdat()! 3. Sleep! ? balance_pgdat()! ? Work until all zones of pgdat are at or over hi-wmark.! ? reclaim function: kswapd_shrink_zone()