xref: /OK3568_Linux_fs/kernel/Documentation/x86/tlb.rst (revision 4882a59341e53eb6f0b4789bf948001014eff981)
1*4882a593Smuzhiyun.. SPDX-License-Identifier: GPL-2.0
2*4882a593Smuzhiyun
3*4882a593Smuzhiyun=======
4*4882a593SmuzhiyunThe TLB
5*4882a593Smuzhiyun=======
6*4882a593Smuzhiyun
7*4882a593SmuzhiyunWhen the kernel unmaps or modified the attributes of a range of
8*4882a593Smuzhiyunmemory, it has two choices:
9*4882a593Smuzhiyun
10*4882a593Smuzhiyun 1. Flush the entire TLB with a two-instruction sequence.  This is
11*4882a593Smuzhiyun    a quick operation, but it causes collateral damage: TLB entries
12*4882a593Smuzhiyun    from areas other than the one we are trying to flush will be
13*4882a593Smuzhiyun    destroyed and must be refilled later, at some cost.
14*4882a593Smuzhiyun 2. Use the invlpg instruction to invalidate a single page at a
15*4882a593Smuzhiyun    time.  This could potentially cost many more instructions, but
16*4882a593Smuzhiyun    it is a much more precise operation, causing no collateral
17*4882a593Smuzhiyun    damage to other TLB entries.
18*4882a593Smuzhiyun
19*4882a593SmuzhiyunWhich method to do depends on a few things:
20*4882a593Smuzhiyun
21*4882a593Smuzhiyun 1. The size of the flush being performed.  A flush of the entire
22*4882a593Smuzhiyun    address space is obviously better performed by flushing the
23*4882a593Smuzhiyun    entire TLB than doing 2^48/PAGE_SIZE individual flushes.
24*4882a593Smuzhiyun 2. The contents of the TLB.  If the TLB is empty, then there will
25*4882a593Smuzhiyun    be no collateral damage caused by doing the global flush, and
26*4882a593Smuzhiyun    all of the individual flush will have ended up being wasted
27*4882a593Smuzhiyun    work.
28*4882a593Smuzhiyun 3. The size of the TLB.  The larger the TLB, the more collateral
29*4882a593Smuzhiyun    damage we do with a full flush.  So, the larger the TLB, the
30*4882a593Smuzhiyun    more attractive an individual flush looks.  Data and
31*4882a593Smuzhiyun    instructions have separate TLBs, as do different page sizes.
32*4882a593Smuzhiyun 4. The microarchitecture.  The TLB has become a multi-level
33*4882a593Smuzhiyun    cache on modern CPUs, and the global flushes have become more
34*4882a593Smuzhiyun    expensive relative to single-page flushes.
35*4882a593Smuzhiyun
36*4882a593SmuzhiyunThere is obviously no way the kernel can know all these things,
37*4882a593Smuzhiyunespecially the contents of the TLB during a given flush.  The
38*4882a593Smuzhiyunsizes of the flush will vary greatly depending on the workload as
39*4882a593Smuzhiyunwell.  There is essentially no "right" point to choose.
40*4882a593Smuzhiyun
41*4882a593SmuzhiyunYou may be doing too many individual invalidations if you see the
42*4882a593Smuzhiyuninvlpg instruction (or instructions _near_ it) show up high in
43*4882a593Smuzhiyunprofiles.  If you believe that individual invalidations being
44*4882a593Smuzhiyuncalled too often, you can lower the tunable::
45*4882a593Smuzhiyun
46*4882a593Smuzhiyun	/sys/kernel/debug/x86/tlb_single_page_flush_ceiling
47*4882a593Smuzhiyun
48*4882a593SmuzhiyunThis will cause us to do the global flush for more cases.
49*4882a593SmuzhiyunLowering it to 0 will disable the use of the individual flushes.
50*4882a593SmuzhiyunSetting it to 1 is a very conservative setting and it should
51*4882a593Smuzhiyunnever need to be 0 under normal circumstances.
52*4882a593Smuzhiyun
53*4882a593SmuzhiyunDespite the fact that a single individual flush on x86 is
54*4882a593Smuzhiyunguaranteed to flush a full 2MB [1]_, hugetlbfs always uses the full
55*4882a593Smuzhiyunflushes.  THP is treated exactly the same as normal memory.
56*4882a593Smuzhiyun
57*4882a593SmuzhiyunYou might see invlpg inside of flush_tlb_mm_range() show up in
58*4882a593Smuzhiyunprofiles, or you can use the trace_tlb_flush() tracepoints. to
59*4882a593Smuzhiyundetermine how long the flush operations are taking.
60*4882a593Smuzhiyun
61*4882a593SmuzhiyunEssentially, you are balancing the cycles you spend doing invlpg
62*4882a593Smuzhiyunwith the cycles that you spend refilling the TLB later.
63*4882a593Smuzhiyun
64*4882a593SmuzhiyunYou can measure how expensive TLB refills are by using
65*4882a593Smuzhiyunperformance counters and 'perf stat', like this::
66*4882a593Smuzhiyun
67*4882a593Smuzhiyun  perf stat -e
68*4882a593Smuzhiyun    cpu/event=0x8,umask=0x84,name=dtlb_load_misses_walk_duration/,
69*4882a593Smuzhiyun    cpu/event=0x8,umask=0x82,name=dtlb_load_misses_walk_completed/,
70*4882a593Smuzhiyun    cpu/event=0x49,umask=0x4,name=dtlb_store_misses_walk_duration/,
71*4882a593Smuzhiyun    cpu/event=0x49,umask=0x2,name=dtlb_store_misses_walk_completed/,
72*4882a593Smuzhiyun    cpu/event=0x85,umask=0x4,name=itlb_misses_walk_duration/,
73*4882a593Smuzhiyun    cpu/event=0x85,umask=0x2,name=itlb_misses_walk_completed/
74*4882a593Smuzhiyun
75*4882a593SmuzhiyunThat works on an IvyBridge-era CPU (i5-3320M).  Different CPUs
76*4882a593Smuzhiyunmay have differently-named counters, but they should at least
77*4882a593Smuzhiyunbe there in some form.  You can use pmu-tools 'ocperf list'
78*4882a593Smuzhiyun(https://github.com/andikleen/pmu-tools) to find the right
79*4882a593Smuzhiyuncounters for a given CPU.
80*4882a593Smuzhiyun
81*4882a593Smuzhiyun.. [1] A footnote in Intel's SDM "4.10.4.2 Recommended Invalidation"
82*4882a593Smuzhiyun   says: "One execution of INVLPG is sufficient even for a page
83*4882a593Smuzhiyun   with size greater than 4 KBytes."
84