Hybrid TLB Coal B Coalescing: I Improving g TLB Translati tion C - PowerPoint PPT Presentation
Hybrid TLB Coal B Coalescing: I Improving g TLB Translati tion C Cover erage e under er D Diver erse e Fr Fragm gmented M Mem emory A y Allocations Chang Hyun Park , Taekyung Heo, Jungi Jeong, and Jaehyuk Huh Introduction
Past Proposals: : Direct Segments • Segment based translation [1] • Three values represent contiguous translation of any size • Fully assoc. lookup for multiple segments (limits size of TLB) • Redundant Memory Mappings (RMM) [6] -> 32 Fully-associative TLB Base Limit Virtual Pages Direct Segment Offset Base Limit Offset Physical Pages Direct Segment [1] Basu et al. ISCA ’13 51 [6] Karakostas et al. ISCA ‘15
Past Proposals: : Direct Segments • Segment based translation [1] • Three values represent contiguous translation of any size • Fully assoc. lookup for multiple segments (limits size of TLB) • Redundant Memory Mappings (RMM) [6] -> 32 Fully-associative TLB Base Limit Virtual Pages Direct Segment Offset Base Limit Offset Physical Pages Direct Segment [1] Basu et al. ISCA ’13 52 [6] Karakostas et al. ISCA ‘15
Past Proposals: : Direct Segments • Segment based translation [1] • Three values represent contiguous translation of any size • Fully assoc. lookup for multiple segments (limits size of TLB) • Redundant Memory Mappings (RMM) [6] -> 32 Fully-associative TLB Base Limit Virtual Pages Direct Segment Offset Base Limit Offset Physical Pages Direct Segment [1] Basu et al. ISCA ’13 53 [6] Karakostas et al. ISCA ‘15
Past Proposals: : Direct Segments • Segment based translation [1] • Three values represent contiguous translation of any size • Fully assoc. lookup for multiple segments (limits size of TLB) • Redundant Memory Mappings (RMM) [6] -> 32 Fully-associative TLB Base Limit Efficient with small number of Virtual Pages Direct Segment big memory chunks Offset Base Limit Offset Physical Pages Direct Segment [1] Basu et al. ISCA ’13 54 [6] Karakostas et al. ISCA ‘15
Past P Proposals: Summary • Large pages • Affinity for large pages (2MB) • Cluster TLB • Affinity for clustering of mapping of up to 8 pages • Segment translations • Affinity for small number of large chunks (32 entry TLB) 55
Past P Proposals: Summary • Large pages • Affinity for large pages (2MB) • Cluster TLB • Affinity for clustering of mapping of up to 8 pages • Segment translations • Affinity for small number of large chunks (32 entry TLB) Prior proposals efficiently support specific memory mapping scenarios 56
Large C La Conti tiguity vs. Memor ory Non Non-Uniform rmity Node Node Node Node Node Node Node Node Regular Pages Large Page [2] Baptiste et al. ATC ’ 14 [3] Lee et al. ISCA ‘15 57 [4] Agarwal et al. ASPLOS ‘17
Large C La Conti tiguity vs. Memor ory Non Non-Uniform rmity • Conflicting goals of NUMA systems and large pages [2] • Memory traffic balance vs. efficient address translation Node Node Node Node Node Node Node Node Regular Pages Large Page [2] Baptiste et al. ATC ’ 14 [3] Lee et al. ISCA ‘15 58 [4] Agarwal et al. ASPLOS ‘17
Large C La Conti tiguity vs. Memor ory Non Non-Uniform rmity • Conflicting goals of NUMA systems and large pages [2] • Memory traffic balance vs. efficient address translation Node Node Node Node Node Node Node Node Regular Pages Large Page • Heterogeneous memory worsens non-uniformity [3][4] [2] Baptiste et al. ATC ’ 14 [3] Lee et al. ISCA ‘15 59 [4] Agarwal et al. ASPLOS ‘17
Large C La Conti tiguity vs. Memor ory Non Non-Uniform rmity • Conflicting goals of NUMA systems and large pages [2] • Memory traffic balance vs. efficient address translation Node Node Node Node Node Node Node Node Regular Pages Large Page • Heterogeneous memory worsens non-uniformity [3][4] [2] Baptiste et al. ATC ’ 14 [3] Lee et al. ISCA ‘15 60 [4] Agarwal et al. ASPLOS ‘17
Large C La Conti tiguity vs. Memor ory Non Non-Uniform rmity • Conflicting goals of NUMA systems and large pages [2] • Memory traffic balance vs. efficient address translation Node Node Node Node Node Node Node Node Regular Pages Large Page • Heterogeneous memory worsens non-uniformity [3][4] [2] Baptiste et al. ATC ’ 14 [3] Lee et al. ISCA ‘15 61 [4] Agarwal et al. ASPLOS ‘17
Large C La Conti tiguity vs. Memor ory Non Non-Uniform rmity • Conflicting goals of NUMA systems and large pages [2] • Memory traffic balance vs. efficient address translation Node Node Node Node Node Node Node Node Regular Pages Large Page • Heterogeneous memory worsens non-uniformity [3][4] [2] Baptiste et al. ATC ’ 14 [3] Lee et al. ISCA ‘15 62 [4] Agarwal et al. ASPLOS ‘17
Large C La Conti tiguity vs. Memor ory Non Non-Uniform rmity • Conflicting goals of NUMA systems and large pages [2] • Memory traffic balance vs. efficient address translation Node Node Node Node Node Node Node Node Regular Pages Large Page • Heterogeneous memory worsens non-uniformity [3][4] [2] Baptiste et al. ATC ’ 14 [3] Lee et al. ISCA ‘15 63 [4] Agarwal et al. ASPLOS ‘17
Large C La Conti tiguity vs. Memor ory Non Non-Uniform rmity • Conflicting goals of NUMA systems and large pages [2] • Memory traffic balance vs. efficient address translation Node Node Node Node Node Node Node Node Regular Pages Large Page • Heterogeneous memory worsens non-uniformity [3][4] [2] Baptiste et al. ATC ’ 14 [3] Lee et al. ISCA ‘15 64 [4] Agarwal et al. ASPLOS ‘17
Large C La Conti tiguity vs. Memor ory Non Non-Uniform rmity • Conflicting goals of NUMA systems and large pages [2] • Memory traffic balance vs. efficient address translation Node Node Node Node Node Node Node Node Regular Pages Large Page • Heterogeneous memory worsens non-uniformity [3][4] [2] Baptiste et al. ATC ’ 14 [3] Lee et al. ISCA ‘15 65 [4] Agarwal et al. ASPLOS ‘17
Large C La Conti tiguity vs. Memor ory Non Non-Uniform rmity • Conflicting goals of NUMA systems and large pages [2] • Memory traffic balance vs. efficient address translation Node Node Node Node Node Node Node Node Regular Pages Large Page • Heterogeneous memory worsens non-uniformity [3][4] [2] Baptiste et al. ATC ’ 14 [3] Lee et al. ISCA ‘15 66 [4] Agarwal et al. ASPLOS ‘17
Large C La Conti tiguity vs. Memor ory Non Non-Uniform rmity • Conflicting goals of NUMA systems and large pages [2] • Memory traffic balance vs. efficient address translation Node Node Node Node Node Node Node Node Regular Pages Large Page • Heterogeneous memory worsens non-uniformity [3][4] [2] Baptiste et al. ATC ’ 14 [3] Lee et al. ISCA ‘15 67 [4] Agarwal et al. ASPLOS ‘17
Large C La Conti tiguity vs. Memor ory Non Non-Uniform rmity • Conflicting goals of NUMA systems and large pages [2] • Memory traffic balance vs. efficient address translation Node Node Node Node Node Node Node Node Regular Pages Large Page • Heterogeneous memory worsens non-uniformity [3][4] [2] Baptiste et al. ATC ’ 14 [3] Lee et al. ISCA ‘15 68 [4] Agarwal et al. ASPLOS ‘17
Large C La Conti tiguity vs. Memor ory Non Non-Uniform rmity • Conflicting goals of NUMA systems and large pages [2] • Memory traffic balance vs. efficient address translation Node Node Node Node Node Node Node Node Regular Pages Large Page • Heterogeneous memory worsens non-uniformity [3][4] [2] Baptiste et al. ATC ’ 14 [3] Lee et al. ISCA ‘15 69 [4] Agarwal et al. ASPLOS ‘17
Large C La Conti tiguity vs. Memor ory Non Non-Uniform rmity • Conflicting goals of NUMA systems and large pages [2] • Memory traffic balance vs. efficient address translation Node Node Node Node Node Node Node Node Regular Pages Large Page • Heterogeneous memory worsens non-uniformity [3][4] [2] Baptiste et al. ATC ’ 14 [3] Lee et al. ISCA ‘15 70 [4] Agarwal et al. ASPLOS ‘17
Large C La Conti tiguity vs. Memor ory Non Non-Uniform rmity • Conflicting goals of NUMA systems and large pages [2] • Memory traffic balance vs. efficient address translation Node Node Node Node Node Node Node Node Regular Pages Large Page • Heterogeneous memory worsens non-uniformity [3][4] [2] Baptiste et al. ATC ’ 14 [3] Lee et al. ISCA ‘15 71 [4] Agarwal et al. ASPLOS ‘17
Large C La Conti tiguity vs. Memor ory Non Non-Uniform rmity • Conflicting goals of NUMA systems and large pages [2] • Memory traffic balance vs. efficient address translation Node Node Node Node Node Node Node Node Regular Pages Large Page • Heterogeneous memory worsens non-uniformity [3][4] [2] Baptiste et al. ATC ’ 14 [3] Lee et al. ISCA ‘15 72 [4] Agarwal et al. ASPLOS ‘17
Large C La Conti tiguity vs. Memor ory Non Non-Uniform rmity • Conflicting goals of NUMA systems and large pages [2] • Memory traffic balance vs. efficient address translation Node Node Node Node Node Node Node Node Regular Pages Large Page • Heterogeneous memory worsens non-uniformity [3][4] [2] Baptiste et al. ATC ’ 14 Cold Pages [3] Lee et al. ISCA ‘15 73 [4] Agarwal et al. ASPLOS ‘17
Large C La Conti tiguity vs. Memor ory Non Non-Uniform rmity • Conflicting goals of NUMA systems and large pages [2] • Memory traffic balance vs. efficient address translation Node Node Node Node Node Node Node Node Regular Pages Large Page • Heterogeneous memory worsens non-uniformity [3][4] [2] Baptiste et al. ATC ’ 14 Hot Pages Cold Pages [3] Lee et al. ISCA ‘15 74 [4] Agarwal et al. ASPLOS ‘17
Large C La Conti tiguity vs. Memor ory Non Non-Uniform rmity • Conflicting goals of NUMA systems and large pages [2] • Memory traffic balance vs. efficient address translation Node Node Node Node Node Node Node Node Regular Pages Large Page • Heterogeneous memory worsens non-uniformity [3][4] [2] Baptiste et al. ATC ’ 14 Hot Pages Cold Pages [3] Lee et al. ISCA ‘15 75 [4] Agarwal et al. ASPLOS ‘17
Large C La Conti tiguity vs. Memor ory Non Non-Uniform rmity • Conflicting goals of NUMA systems and large pages [2] • Memory traffic balance vs. efficient address translation Node Node Node Node Different systems have different memory mapping needs Node Node Node Node Regular Pages Large Page • Heterogeneous memory worsens non-uniformity [3][4] [2] Baptiste et al. ATC ’ 14 Hot Pages Cold Pages [3] Lee et al. ISCA ‘15 76 [4] Agarwal et al. ASPLOS ‘17
Need f for a r an Al All-Roun unde der Solution • Contiguity distribution varies among workloads • Also varies within the same workload [7] 77 [7] Kwon et al. OSDI ’16
Need f for a r an Al All-Roun unde der Solution • Contiguity distribution varies among workloads • Also varies within the same workload [7] CDF of process memory 78 [7] Kwon et al. OSDI ’16
Need f for a r an Al All-Roun unde der Solution • Contiguity distribution varies among workloads • Also varies within the same workload [7] CDF of process memory Well suited for Cluster 79 [7] Kwon et al. OSDI ’16
Need f for a r an Al All-Roun unde der Solution • Contiguity distribution varies among workloads • Also varies within the same workload [7] Well suited for Large pages CDF of process memory Well suited for Cluster 80 [7] Kwon et al. OSDI ’16
Need f for a r an Al All-Roun unde der Solution • Contiguity distribution varies among workloads • Also varies within the same workload [7] Well suited for Large pages CDF of process memory Well suited for Cluster Well suited for ?? 81 [7] Kwon et al. OSDI ’16
Need f for a r an Al All-Roun unde der Solution • Contiguity distribution varies among workloads • Also varies within the same workload [7] Well suited for Large pages CDF of process memory Well suited for Cluster Well suited for ?? 82 [7] Kwon et al. OSDI ’16
Need f for a r an Al All-Roun unde der Solution • Contiguity distribution varies among workloads • Also varies within the same workload [7] Well suited for Large pages CDF of process memory Well suited for Cluster Well suited for ?? Can we make a TLB scheme that works well for diverse scenarios? 83 [7] Kwon et al. OSDI ’16
Hyb ybrid T TLB LB C Coalesci cing Hardware TLB Operating System Page Table 84
Hyb ybrid T TLB LB C Coalesci cing Hardware We propose a TLB with TLB adjustable coverage • HW - SW Joint Effort • HW offers adjustable TLB coverage • Number of TLB entries fixed • Coverage of entry adjustable Operating System • OS decides best TLB coverage • Adjusts TLB coverage per process • OS identifies contiguous chunks Page Table • Marks onto process page table 85
Hyb ybrid T TLB LB C Coalesci cing Hardware We propose a TLB with TLB TLB adjustable coverage • HW - SW Joint Effort • HW offers adjustable TLB coverage • Number of TLB entries fixed • Coverage of entry adjustable Operating System • OS decides best TLB coverage • Adjusts TLB coverage per process • OS identifies contiguous chunks Page Table • Marks onto process page table 86
Hyb ybrid T TLB LB C Coalesci cing Hardware We propose a TLB with TLB TLB adjustable coverage • HW - SW Joint Effort • HW offers adjustable TLB coverage • Number of TLB entries fixed • Coverage of entry adjustable Operating System • OS decides best TLB coverage • Adjusts TLB coverage per process • OS identifies contiguous chunks Page Table • Marks onto process page table 87
Hyb ybrid T TLB LB C Coalesci cing Hardware We propose a TLB with TLB TLB adjustable coverage • HW - SW Joint Effort • HW offers adjustable TLB coverage • Number of TLB entries fixed • Coverage of entry adjustable Operating System • OS decides best TLB coverage • Adjusts TLB coverage per process • OS identifies contiguous chunks Page Table • Marks onto process page table 88
Hyb ybrid T TLB LB C Coalesci cing Hardware We propose a TLB with TLB TLB adjustable coverage • HW - SW Joint Effort • HW offers adjustable TLB coverage • Number of TLB entries fixed • Coverage of entry adjustable Operating System • OS decides best TLB coverage • Adjusts TLB coverage per process • OS identifies contiguous chunks Page Table • Marks onto process page table 89
Hyb ybrid T TLB LB C Coalesci cing Hardware We propose a TLB with TLB TLB adjustable coverage • HW - SW Joint Effort • HW offers adjustable TLB coverage • Number of TLB entries fixed • Coverage of entry adjustable Operating System • OS decides best TLB coverage • Adjusts TLB coverage per process • OS identifies contiguous chunks Page Table • Marks onto process page table 90
Hyb ybrid T TLB LB C Coalesci cing Hardware We propose a TLB with TLB TLB adjustable coverage • HW - SW Joint Effort • HW offers adjustable TLB coverage • Number of TLB entries fixed • Coverage of entry adjustable Operating System • OS decides best TLB coverage • Adjusts TLB coverage per process • OS identifies contiguous chunks Page Table • Marks onto process page table 91
Anch nchor • Anchors are special entries in the page table • Placed at every alignments of anchor distance • Anchor distance is a power of 2 (for encoding efficiency) • Anchor distance configurable by OS Anchor Distance = 8 Page Table 92
Anch nchor • Anchors are special entries in the page table • Placed at every alignments of anchor distance • Anchor distance is a power of 2 (for encoding efficiency) • Anchor distance configurable by OS Anchor Distance = 4 Anchor Distance = 8 Page Table 93
Anch nchor • Anchors are special entries in the page table • Placed at every alignments of anchor distance • Anchor distance is a power of 2 (for encoding efficiency) • Anchor distance configurable by OS Anchor Distance = 4 Anchor Distance = 8 Page Table 94
Anch nchor • Anchors are special entries in the page table • Placed at every alignments of anchor distance • Anchor distance is a power of 2 (for encoding efficiency) • Anchor distance configurable by OS Anchor Distance = 4 Anchor Distance = 8 Page Table 95
Anch nchor • Anchors are special entries in the page table • Placed at every alignments of anchor distance • Anchor distance is a power of 2 (for encoding efficiency) • Anchor distance configurable by OS Anchor Distance = 4 Anchor Distance = 8 Page Table 96
Anch nchor • Anchors are special entries in the page table • Placed at every alignments of anchor distance • Anchor distance is a power of 2 (for encoding efficiency) • Anchor distance configurable by OS Anchor Distance = 4 Anchor Distance = 8 Page Table 97
Anch nchor • Anchors are special entries in the page table • Placed at every alignments of anchor distance • Anchor distance is a power of 2 (for encoding efficiency) • Anchor distance configurable by OS Anchor Distance = 4 Anchor Distance = 8 Page Table 98
Anchor An r Page Table • Uses the Page Table • Anchor covers up to distance(4) contiguous pages • Each anchor represents contiguity that begins at anchor • OS marks contiguity onto the anchor page table Virtual Pages 2 3 4 1 4 Anchor Mappings Regular Mappings Physical Pages 99
Anchor An r Page Table • Uses the Page Table • Anchor covers up to distance(4) contiguous pages • Each anchor represents contiguity that begins at anchor • OS marks contiguity onto the anchor page table Virtual Pages 2 3 4 1 4 Anchor Mappings Regular Mappings Physical Pages 100
Anchor An r Page Table • Uses the Page Table • Anchor covers up to distance(4) contiguous pages • Each anchor represents contiguity that begins at anchor • OS marks contiguity onto the anchor page table Virtual Pages 2 3 4 1 4 Anchor Mappings Regular Mappings Physical Pages 101
An Anchor r TLB • Integrated into the L2 TLB • L1 keeps regular entries • Caches both regular and anchor page table entries • Regular and anchor indexed differently Anchor TLB (4 sets) Virtual Pages 2 3 4 0 4 TLB Entries 0 | 2 0 | 3 0 | 4 1 | 4 3 | X Anchor Entry 3 | X Tag | Contiguity 3 | X Regular Entry 102
An Anchor r TLB Lookup • On L1 TLB Miss Anchor TLB looks up • Regular TLB first • Anchor TLB next Anchor TLB (4 sets) Virtual Pages 2 3 4 0 4 0 | 2 1 | 4 0 | 3 3 | X 0 | 4 3 | X 3 | X Anchor Entry Regular Entry 103
An Anchor r TLB Lookup • On L1 TLB Miss Anchor TLB looks up • Regular TLB first • Anchor TLB next Anchor TLB (4 sets) Virtual Pages 2 3 4 0 4 0 | 2 1 | 4 0 | 3 3 | X 0 | 4 3 | X 3 | X Anchor Entry Regular Entry 104
An Anchor r TLB Lookup • On L1 TLB Miss Anchor TLB looks up • Regular TLB first • Anchor TLB next Anchor TLB (4 sets) Virtual Pages 2 3 4 0 4 0 | 2 1 | 4 0 | 3 3 | X 0 | 4 3 | X 0 | 3 3 | X Anchor Entry Regular Entry 105
An Anchor r TLB Lookup • On L1 TLB Miss Anchor TLB looks up • Regular TLB first • Anchor TLB next Anchor TLB (4 sets) Virtual Pages 2 3 4 0 4 0 | 2 1 | 4 0 | 3 3 | X 0 | 4 3 | X 0 | 3 3 | X Anchor Entry Regular Entry 106
Anchor An r TLB Lookup • On L1 TLB Miss Anchor TLB looks up • Regular TLB first • Anchor TLB next Anchor TLB (4 sets) Virtual Pages 2 3 4 0 4 0 | 2 1 | 4 0 | 3 3 | X 0 | 4 3 | X 0 | 3 3 | X Offset (2) < Anchor Entry Contiguity (3) Regular Entry 107
Anchor An r TLB Lookup • On L1 TLB Miss Anchor TLB looks up • Regular TLB first • Anchor TLB next Anchor TLB (4 sets) Virtual Pages 2 3 4 0 4 0 | 2 1 | 4 0 | 3 3 | X 0 | 4 3 | X 0 | 3 3 | X Offset (2) < Anchor Entry Contiguity (3) Regular Entry HIT 108 return Anchor PFN + offset
An Anchor r TLB Lookup • On L1 TLB Miss Anchor TLB looks up • Regular TLB first • Anchor TLB next Anchor TLB (4 sets) Virtual Pages 2 3 4 0 4 0 | 2 1 | 4 0 | 3 3 | X 0 | 4 3 | X 0 | 3 3 | X Offset (2) < MISS Anchor Entry Contiguity (3) Start Page Walk Regular Entry HIT 109 return Anchor PFN + offset
An Anchor r TLB Lookup • On L1 TLB Miss Anchor TLB looks up • Regular TLB first • Anchor TLB next Anchor TLB (4 sets) Virtual Pages 2 3 4 0 4 0 | 2 1 | 4 0 | 3 3 | X 0 | 4 3 | X 0 | 3 3 | X Offset (2) < MISS Anchor Entry Contiguity (3) Start Page Walk Regular Entry HIT 110 return Anchor PFN + offset
Operati ting S System Responsibilities • OS periodically selects process anchor distance • Heuristic algorithm to minimize TLB entry count • OS adjusts anchor distance • Anchor distance based on selection algorithm • OS marks mapping contiguity • Memory mapping contiguity in anchor page table entry 111
Simulation on M Met ethodol ology • Trace based TLB simulator (Based on Intel Haswell) TLB Configuration Common L1 4KB: 64 entry, 4 way 2MB: 32 entry, 4 way Baseline L2 / THP 4KB/2MB: 1024 entry, 8 way Cluster Regular (4KB/2MB): 768 entry, 6 way Cluster-8: 320 entry, 5 way RMM (Multiple segments) Baseline L2 TLB + RMM: 32 entry, fully-assoc. Anchor (Selected/Static 4KB/2MB/anchor: 1024 entry, 8 way Ideal) 112
Memory Mapping Scenarios • Two class of memory mapping scenarios • Two real system memory mappings • Four synthetic memory mappings Name Trace information demand Default Linux memory mapping eager ‘Eager’ allocation low 1– 16 pages (4KB – 64KB) medium 1 – 512 pages (4KB – 2MB) high 512 – 64K pages (2MB – 256MB) max Maximum contiguity 113
Evaluati tion – TLB LB M Misses of demand mappi ping ng 100 90 Relative TLB Misses (%) 80 70 60 50 40 30 20 10 0 THP Cluster RMM Anchor Selected Anchor Ideal 114
Evaluati tion – TLB LB M Misses of demand mappi ping ng 100 90 Relative TLB Misses (%) 80 70 60 50 40 30 20 10 0 THP Cluster RMM Anchor Selected Anchor Ideal 115
Evaluati tion – TLB LB M Misses of demand mappi ping ng 100 90 Relative TLB Misses (%) 80 70 60 50 40 30 20 10 0 THP Cluster RMM Anchor Selected Anchor Ideal 116
Evaluati tion – TLB LB M Misses of demand mappi ping ng 100 90 Relative TLB Misses (%) 80 70 60 50 40 30 20 10 0 Anchor TLB adjusted to satisfy small contiguities THP Cluster RMM Anchor Selected Anchor Ideal 117
Evaluati tion – TLB LB M Misses of medium mappi ping 100 90 Relative TLB Misses (%) 80 70 60 50 40 30 20 10 0 THP Cluster RMM Anchor Selected Anchor Ideal 118
Evaluati tion – TLB LB M Misses of medium mappi ping 100 90 Relative TLB Misses (%) 80 70 60 50 40 30 20 10 0 Anchor adjusted coverage to provide best TLB reduction THP Cluster RMM Anchor Selected Anchor Ideal 119
Evaluati tion – TLB M Misses of all mappi ping ng 100 90 80 Relative TLB Misses (%) 70 60 50 40 30 20 10 0 demand eager low cont. med cont. high cont. max cont. Baseline THP Cluster RMM Anchor Selected Anchor Ideal 120
Recommend
More recommend
Explore More Topics
Stay informed with curated content and fresh updates.