1*4882a593Smuzhiyun.. SPDX-License-Identifier: GPL-2.0 2*4882a593Smuzhiyun 3*4882a593Smuzhiyun======= 4*4882a593SmuzhiyunThe TLB 5*4882a593Smuzhiyun======= 6*4882a593Smuzhiyun 7*4882a593SmuzhiyunWhen the kernel unmaps or modified the attributes of a range of 8*4882a593Smuzhiyunmemory, it has two choices: 9*4882a593Smuzhiyun 10*4882a593Smuzhiyun 1. Flush the entire TLB with a two-instruction sequence. This is 11*4882a593Smuzhiyun a quick operation, but it causes collateral damage: TLB entries 12*4882a593Smuzhiyun from areas other than the one we are trying to flush will be 13*4882a593Smuzhiyun destroyed and must be refilled later, at some cost. 14*4882a593Smuzhiyun 2. Use the invlpg instruction to invalidate a single page at a 15*4882a593Smuzhiyun time. This could potentially cost many more instructions, but 16*4882a593Smuzhiyun it is a much more precise operation, causing no collateral 17*4882a593Smuzhiyun damage to other TLB entries. 18*4882a593Smuzhiyun 19*4882a593SmuzhiyunWhich method to do depends on a few things: 20*4882a593Smuzhiyun 21*4882a593Smuzhiyun 1. The size of the flush being performed. A flush of the entire 22*4882a593Smuzhiyun address space is obviously better performed by flushing the 23*4882a593Smuzhiyun entire TLB than doing 2^48/PAGE_SIZE individual flushes. 24*4882a593Smuzhiyun 2. The contents of the TLB. If the TLB is empty, then there will 25*4882a593Smuzhiyun be no collateral damage caused by doing the global flush, and 26*4882a593Smuzhiyun all of the individual flush will have ended up being wasted 27*4882a593Smuzhiyun work. 28*4882a593Smuzhiyun 3. The size of the TLB. The larger the TLB, the more collateral 29*4882a593Smuzhiyun damage we do with a full flush. So, the larger the TLB, the 30*4882a593Smuzhiyun more attractive an individual flush looks. Data and 31*4882a593Smuzhiyun instructions have separate TLBs, as do different page sizes. 32*4882a593Smuzhiyun 4. The microarchitecture. The TLB has become a multi-level 33*4882a593Smuzhiyun cache on modern CPUs, and the global flushes have become more 34*4882a593Smuzhiyun expensive relative to single-page flushes. 35*4882a593Smuzhiyun 36*4882a593SmuzhiyunThere is obviously no way the kernel can know all these things, 37*4882a593Smuzhiyunespecially the contents of the TLB during a given flush. The 38*4882a593Smuzhiyunsizes of the flush will vary greatly depending on the workload as 39*4882a593Smuzhiyunwell. There is essentially no "right" point to choose. 40*4882a593Smuzhiyun 41*4882a593SmuzhiyunYou may be doing too many individual invalidations if you see the 42*4882a593Smuzhiyuninvlpg instruction (or instructions _near_ it) show up high in 43*4882a593Smuzhiyunprofiles. If you believe that individual invalidations being 44*4882a593Smuzhiyuncalled too often, you can lower the tunable:: 45*4882a593Smuzhiyun 46*4882a593Smuzhiyun /sys/kernel/debug/x86/tlb_single_page_flush_ceiling 47*4882a593Smuzhiyun 48*4882a593SmuzhiyunThis will cause us to do the global flush for more cases. 49*4882a593SmuzhiyunLowering it to 0 will disable the use of the individual flushes. 50*4882a593SmuzhiyunSetting it to 1 is a very conservative setting and it should 51*4882a593Smuzhiyunnever need to be 0 under normal circumstances. 52*4882a593Smuzhiyun 53*4882a593SmuzhiyunDespite the fact that a single individual flush on x86 is 54*4882a593Smuzhiyunguaranteed to flush a full 2MB [1]_, hugetlbfs always uses the full 55*4882a593Smuzhiyunflushes. THP is treated exactly the same as normal memory. 56*4882a593Smuzhiyun 57*4882a593SmuzhiyunYou might see invlpg inside of flush_tlb_mm_range() show up in 58*4882a593Smuzhiyunprofiles, or you can use the trace_tlb_flush() tracepoints. to 59*4882a593Smuzhiyundetermine how long the flush operations are taking. 60*4882a593Smuzhiyun 61*4882a593SmuzhiyunEssentially, you are balancing the cycles you spend doing invlpg 62*4882a593Smuzhiyunwith the cycles that you spend refilling the TLB later. 63*4882a593Smuzhiyun 64*4882a593SmuzhiyunYou can measure how expensive TLB refills are by using 65*4882a593Smuzhiyunperformance counters and 'perf stat', like this:: 66*4882a593Smuzhiyun 67*4882a593Smuzhiyun perf stat -e 68*4882a593Smuzhiyun cpu/event=0x8,umask=0x84,name=dtlb_load_misses_walk_duration/, 69*4882a593Smuzhiyun cpu/event=0x8,umask=0x82,name=dtlb_load_misses_walk_completed/, 70*4882a593Smuzhiyun cpu/event=0x49,umask=0x4,name=dtlb_store_misses_walk_duration/, 71*4882a593Smuzhiyun cpu/event=0x49,umask=0x2,name=dtlb_store_misses_walk_completed/, 72*4882a593Smuzhiyun cpu/event=0x85,umask=0x4,name=itlb_misses_walk_duration/, 73*4882a593Smuzhiyun cpu/event=0x85,umask=0x2,name=itlb_misses_walk_completed/ 74*4882a593Smuzhiyun 75*4882a593SmuzhiyunThat works on an IvyBridge-era CPU (i5-3320M). Different CPUs 76*4882a593Smuzhiyunmay have differently-named counters, but they should at least 77*4882a593Smuzhiyunbe there in some form. You can use pmu-tools 'ocperf list' 78*4882a593Smuzhiyun(https://github.com/andikleen/pmu-tools) to find the right 79*4882a593Smuzhiyuncounters for a given CPU. 80*4882a593Smuzhiyun 81*4882a593Smuzhiyun.. [1] A footnote in Intel's SDM "4.10.4.2 Recommended Invalidation" 82*4882a593Smuzhiyun says: "One execution of INVLPG is sufficient even for a page 83*4882a593Smuzhiyun with size greater than 4 KBytes." 84