xref: /OK3568_Linux_fs/kernel/Documentation/dev-tools/kcsan.rst (revision 4882a59341e53eb6f0b4789bf948001014eff981)
1*4882a593SmuzhiyunThe Kernel Concurrency Sanitizer (KCSAN)
2*4882a593Smuzhiyun========================================
3*4882a593Smuzhiyun
4*4882a593SmuzhiyunThe Kernel Concurrency Sanitizer (KCSAN) is a dynamic race detector, which
5*4882a593Smuzhiyunrelies on compile-time instrumentation, and uses a watchpoint-based sampling
6*4882a593Smuzhiyunapproach to detect races. KCSAN's primary purpose is to detect `data races`_.
7*4882a593Smuzhiyun
8*4882a593SmuzhiyunUsage
9*4882a593Smuzhiyun-----
10*4882a593Smuzhiyun
11*4882a593SmuzhiyunKCSAN is supported by both GCC and Clang. With GCC we require version 11 or
12*4882a593Smuzhiyunlater, and with Clang also require version 11 or later.
13*4882a593Smuzhiyun
14*4882a593SmuzhiyunTo enable KCSAN configure the kernel with::
15*4882a593Smuzhiyun
16*4882a593Smuzhiyun    CONFIG_KCSAN = y
17*4882a593Smuzhiyun
18*4882a593SmuzhiyunKCSAN provides several other configuration options to customize behaviour (see
19*4882a593Smuzhiyunthe respective help text in ``lib/Kconfig.kcsan`` for more info).
20*4882a593Smuzhiyun
21*4882a593SmuzhiyunError reports
22*4882a593Smuzhiyun~~~~~~~~~~~~~
23*4882a593Smuzhiyun
24*4882a593SmuzhiyunA typical data race report looks like this::
25*4882a593Smuzhiyun
26*4882a593Smuzhiyun    ==================================================================
27*4882a593Smuzhiyun    BUG: KCSAN: data-race in generic_permission / kernfs_refresh_inode
28*4882a593Smuzhiyun
29*4882a593Smuzhiyun    write to 0xffff8fee4c40700c of 4 bytes by task 175 on cpu 4:
30*4882a593Smuzhiyun     kernfs_refresh_inode+0x70/0x170
31*4882a593Smuzhiyun     kernfs_iop_permission+0x4f/0x90
32*4882a593Smuzhiyun     inode_permission+0x190/0x200
33*4882a593Smuzhiyun     link_path_walk.part.0+0x503/0x8e0
34*4882a593Smuzhiyun     path_lookupat.isra.0+0x69/0x4d0
35*4882a593Smuzhiyun     filename_lookup+0x136/0x280
36*4882a593Smuzhiyun     user_path_at_empty+0x47/0x60
37*4882a593Smuzhiyun     vfs_statx+0x9b/0x130
38*4882a593Smuzhiyun     __do_sys_newlstat+0x50/0xb0
39*4882a593Smuzhiyun     __x64_sys_newlstat+0x37/0x50
40*4882a593Smuzhiyun     do_syscall_64+0x85/0x260
41*4882a593Smuzhiyun     entry_SYSCALL_64_after_hwframe+0x44/0xa9
42*4882a593Smuzhiyun
43*4882a593Smuzhiyun    read to 0xffff8fee4c40700c of 4 bytes by task 166 on cpu 6:
44*4882a593Smuzhiyun     generic_permission+0x5b/0x2a0
45*4882a593Smuzhiyun     kernfs_iop_permission+0x66/0x90
46*4882a593Smuzhiyun     inode_permission+0x190/0x200
47*4882a593Smuzhiyun     link_path_walk.part.0+0x503/0x8e0
48*4882a593Smuzhiyun     path_lookupat.isra.0+0x69/0x4d0
49*4882a593Smuzhiyun     filename_lookup+0x136/0x280
50*4882a593Smuzhiyun     user_path_at_empty+0x47/0x60
51*4882a593Smuzhiyun     do_faccessat+0x11a/0x390
52*4882a593Smuzhiyun     __x64_sys_access+0x3c/0x50
53*4882a593Smuzhiyun     do_syscall_64+0x85/0x260
54*4882a593Smuzhiyun     entry_SYSCALL_64_after_hwframe+0x44/0xa9
55*4882a593Smuzhiyun
56*4882a593Smuzhiyun    Reported by Kernel Concurrency Sanitizer on:
57*4882a593Smuzhiyun    CPU: 6 PID: 166 Comm: systemd-journal Not tainted 5.3.0-rc7+ #1
58*4882a593Smuzhiyun    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
59*4882a593Smuzhiyun    ==================================================================
60*4882a593Smuzhiyun
61*4882a593SmuzhiyunThe header of the report provides a short summary of the functions involved in
62*4882a593Smuzhiyunthe race. It is followed by the access types and stack traces of the 2 threads
63*4882a593Smuzhiyuninvolved in the data race.
64*4882a593Smuzhiyun
65*4882a593SmuzhiyunThe other less common type of data race report looks like this::
66*4882a593Smuzhiyun
67*4882a593Smuzhiyun    ==================================================================
68*4882a593Smuzhiyun    BUG: KCSAN: data-race in e1000_clean_rx_irq+0x551/0xb10
69*4882a593Smuzhiyun
70*4882a593Smuzhiyun    race at unknown origin, with read to 0xffff933db8a2ae6c of 1 bytes by interrupt on cpu 0:
71*4882a593Smuzhiyun     e1000_clean_rx_irq+0x551/0xb10
72*4882a593Smuzhiyun     e1000_clean+0x533/0xda0
73*4882a593Smuzhiyun     net_rx_action+0x329/0x900
74*4882a593Smuzhiyun     __do_softirq+0xdb/0x2db
75*4882a593Smuzhiyun     irq_exit+0x9b/0xa0
76*4882a593Smuzhiyun     do_IRQ+0x9c/0xf0
77*4882a593Smuzhiyun     ret_from_intr+0x0/0x18
78*4882a593Smuzhiyun     default_idle+0x3f/0x220
79*4882a593Smuzhiyun     arch_cpu_idle+0x21/0x30
80*4882a593Smuzhiyun     do_idle+0x1df/0x230
81*4882a593Smuzhiyun     cpu_startup_entry+0x14/0x20
82*4882a593Smuzhiyun     rest_init+0xc5/0xcb
83*4882a593Smuzhiyun     arch_call_rest_init+0x13/0x2b
84*4882a593Smuzhiyun     start_kernel+0x6db/0x700
85*4882a593Smuzhiyun
86*4882a593Smuzhiyun    Reported by Kernel Concurrency Sanitizer on:
87*4882a593Smuzhiyun    CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.3.0-rc7+ #2
88*4882a593Smuzhiyun    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
89*4882a593Smuzhiyun    ==================================================================
90*4882a593Smuzhiyun
91*4882a593SmuzhiyunThis report is generated where it was not possible to determine the other
92*4882a593Smuzhiyunracing thread, but a race was inferred due to the data value of the watched
93*4882a593Smuzhiyunmemory location having changed. These can occur either due to missing
94*4882a593Smuzhiyuninstrumentation or e.g. DMA accesses. These reports will only be generated if
95*4882a593Smuzhiyun``CONFIG_KCSAN_REPORT_RACE_UNKNOWN_ORIGIN=y`` (selected by default).
96*4882a593Smuzhiyun
97*4882a593SmuzhiyunSelective analysis
98*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~
99*4882a593Smuzhiyun
100*4882a593SmuzhiyunIt may be desirable to disable data race detection for specific accesses,
101*4882a593Smuzhiyunfunctions, compilation units, or entire subsystems.  For static blacklisting,
102*4882a593Smuzhiyunthe below options are available:
103*4882a593Smuzhiyun
104*4882a593Smuzhiyun* KCSAN understands the ``data_race(expr)`` annotation, which tells KCSAN that
105*4882a593Smuzhiyun  any data races due to accesses in ``expr`` should be ignored and resulting
106*4882a593Smuzhiyun  behaviour when encountering a data race is deemed safe.
107*4882a593Smuzhiyun
108*4882a593Smuzhiyun* Disabling data race detection for entire functions can be accomplished by
109*4882a593Smuzhiyun  using the function attribute ``__no_kcsan``::
110*4882a593Smuzhiyun
111*4882a593Smuzhiyun    __no_kcsan
112*4882a593Smuzhiyun    void foo(void) {
113*4882a593Smuzhiyun        ...
114*4882a593Smuzhiyun
115*4882a593Smuzhiyun  To dynamically limit for which functions to generate reports, see the
116*4882a593Smuzhiyun  `DebugFS interface`_ blacklist/whitelist feature.
117*4882a593Smuzhiyun
118*4882a593Smuzhiyun* To disable data race detection for a particular compilation unit, add to the
119*4882a593Smuzhiyun  ``Makefile``::
120*4882a593Smuzhiyun
121*4882a593Smuzhiyun    KCSAN_SANITIZE_file.o := n
122*4882a593Smuzhiyun
123*4882a593Smuzhiyun* To disable data race detection for all compilation units listed in a
124*4882a593Smuzhiyun  ``Makefile``, add to the respective ``Makefile``::
125*4882a593Smuzhiyun
126*4882a593Smuzhiyun    KCSAN_SANITIZE := n
127*4882a593Smuzhiyun
128*4882a593SmuzhiyunFurthermore, it is possible to tell KCSAN to show or hide entire classes of
129*4882a593Smuzhiyundata races, depending on preferences. These can be changed via the following
130*4882a593SmuzhiyunKconfig options:
131*4882a593Smuzhiyun
132*4882a593Smuzhiyun* ``CONFIG_KCSAN_REPORT_VALUE_CHANGE_ONLY``: If enabled and a conflicting write
133*4882a593Smuzhiyun  is observed via a watchpoint, but the data value of the memory location was
134*4882a593Smuzhiyun  observed to remain unchanged, do not report the data race.
135*4882a593Smuzhiyun
136*4882a593Smuzhiyun* ``CONFIG_KCSAN_ASSUME_PLAIN_WRITES_ATOMIC``: Assume that plain aligned writes
137*4882a593Smuzhiyun  up to word size are atomic by default. Assumes that such writes are not
138*4882a593Smuzhiyun  subject to unsafe compiler optimizations resulting in data races. The option
139*4882a593Smuzhiyun  causes KCSAN to not report data races due to conflicts where the only plain
140*4882a593Smuzhiyun  accesses are aligned writes up to word size.
141*4882a593Smuzhiyun
142*4882a593SmuzhiyunDebugFS interface
143*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~
144*4882a593Smuzhiyun
145*4882a593SmuzhiyunThe file ``/sys/kernel/debug/kcsan`` provides the following interface:
146*4882a593Smuzhiyun
147*4882a593Smuzhiyun* Reading ``/sys/kernel/debug/kcsan`` returns various runtime statistics.
148*4882a593Smuzhiyun
149*4882a593Smuzhiyun* Writing ``on`` or ``off`` to ``/sys/kernel/debug/kcsan`` allows turning KCSAN
150*4882a593Smuzhiyun  on or off, respectively.
151*4882a593Smuzhiyun
152*4882a593Smuzhiyun* Writing ``!some_func_name`` to ``/sys/kernel/debug/kcsan`` adds
153*4882a593Smuzhiyun  ``some_func_name`` to the report filter list, which (by default) blacklists
154*4882a593Smuzhiyun  reporting data races where either one of the top stackframes are a function
155*4882a593Smuzhiyun  in the list.
156*4882a593Smuzhiyun
157*4882a593Smuzhiyun* Writing either ``blacklist`` or ``whitelist`` to ``/sys/kernel/debug/kcsan``
158*4882a593Smuzhiyun  changes the report filtering behaviour. For example, the blacklist feature
159*4882a593Smuzhiyun  can be used to silence frequently occurring data races; the whitelist feature
160*4882a593Smuzhiyun  can help with reproduction and testing of fixes.
161*4882a593Smuzhiyun
162*4882a593SmuzhiyunTuning performance
163*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~
164*4882a593Smuzhiyun
165*4882a593SmuzhiyunCore parameters that affect KCSAN's overall performance and bug detection
166*4882a593Smuzhiyunability are exposed as kernel command-line arguments whose defaults can also be
167*4882a593Smuzhiyunchanged via the corresponding Kconfig options.
168*4882a593Smuzhiyun
169*4882a593Smuzhiyun* ``kcsan.skip_watch`` (``CONFIG_KCSAN_SKIP_WATCH``): Number of per-CPU memory
170*4882a593Smuzhiyun  operations to skip, before another watchpoint is set up. Setting up
171*4882a593Smuzhiyun  watchpoints more frequently will result in the likelihood of races to be
172*4882a593Smuzhiyun  observed to increase. This parameter has the most significant impact on
173*4882a593Smuzhiyun  overall system performance and race detection ability.
174*4882a593Smuzhiyun
175*4882a593Smuzhiyun* ``kcsan.udelay_task`` (``CONFIG_KCSAN_UDELAY_TASK``): For tasks, the
176*4882a593Smuzhiyun  microsecond delay to stall execution after a watchpoint has been set up.
177*4882a593Smuzhiyun  Larger values result in the window in which we may observe a race to
178*4882a593Smuzhiyun  increase.
179*4882a593Smuzhiyun
180*4882a593Smuzhiyun* ``kcsan.udelay_interrupt`` (``CONFIG_KCSAN_UDELAY_INTERRUPT``): For
181*4882a593Smuzhiyun  interrupts, the microsecond delay to stall execution after a watchpoint has
182*4882a593Smuzhiyun  been set up. Interrupts have tighter latency requirements, and their delay
183*4882a593Smuzhiyun  should generally be smaller than the one chosen for tasks.
184*4882a593Smuzhiyun
185*4882a593SmuzhiyunThey may be tweaked at runtime via ``/sys/module/kcsan/parameters/``.
186*4882a593Smuzhiyun
187*4882a593SmuzhiyunData Races
188*4882a593Smuzhiyun----------
189*4882a593Smuzhiyun
190*4882a593SmuzhiyunIn an execution, two memory accesses form a *data race* if they *conflict*,
191*4882a593Smuzhiyunthey happen concurrently in different threads, and at least one of them is a
192*4882a593Smuzhiyun*plain access*; they *conflict* if both access the same memory location, and at
193*4882a593Smuzhiyunleast one is a write. For a more thorough discussion and definition, see `"Plain
194*4882a593SmuzhiyunAccesses and Data Races" in the LKMM`_.
195*4882a593Smuzhiyun
196*4882a593Smuzhiyun.. _"Plain Accesses and Data Races" in the LKMM: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/memory-model/Documentation/explanation.txt#n1922
197*4882a593Smuzhiyun
198*4882a593SmuzhiyunRelationship with the Linux-Kernel Memory Consistency Model (LKMM)
199*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
200*4882a593Smuzhiyun
201*4882a593SmuzhiyunThe LKMM defines the propagation and ordering rules of various memory
202*4882a593Smuzhiyunoperations, which gives developers the ability to reason about concurrent code.
203*4882a593SmuzhiyunUltimately this allows to determine the possible executions of concurrent code,
204*4882a593Smuzhiyunand if that code is free from data races.
205*4882a593Smuzhiyun
206*4882a593SmuzhiyunKCSAN is aware of *marked atomic operations* (``READ_ONCE``, ``WRITE_ONCE``,
207*4882a593Smuzhiyun``atomic_*``, etc.), but is oblivious of any ordering guarantees and simply
208*4882a593Smuzhiyunassumes that memory barriers are placed correctly. In other words, KCSAN
209*4882a593Smuzhiyunassumes that as long as a plain access is not observed to race with another
210*4882a593Smuzhiyunconflicting access, memory operations are correctly ordered.
211*4882a593Smuzhiyun
212*4882a593SmuzhiyunThis means that KCSAN will not report *potential* data races due to missing
213*4882a593Smuzhiyunmemory ordering. Developers should therefore carefully consider the required
214*4882a593Smuzhiyunmemory ordering requirements that remain unchecked. If, however, missing
215*4882a593Smuzhiyunmemory ordering (that is observable with a particular compiler and
216*4882a593Smuzhiyunarchitecture) leads to an observable data race (e.g. entering a critical
217*4882a593Smuzhiyunsection erroneously), KCSAN would report the resulting data race.
218*4882a593Smuzhiyun
219*4882a593SmuzhiyunRace Detection Beyond Data Races
220*4882a593Smuzhiyun--------------------------------
221*4882a593Smuzhiyun
222*4882a593SmuzhiyunFor code with complex concurrency design, race-condition bugs may not always
223*4882a593Smuzhiyunmanifest as data races. Race conditions occur if concurrently executing
224*4882a593Smuzhiyunoperations result in unexpected system behaviour. On the other hand, data races
225*4882a593Smuzhiyunare defined at the C-language level. The following macros can be used to check
226*4882a593Smuzhiyunproperties of concurrent code where bugs would not manifest as data races.
227*4882a593Smuzhiyun
228*4882a593Smuzhiyun.. kernel-doc:: include/linux/kcsan-checks.h
229*4882a593Smuzhiyun    :functions: ASSERT_EXCLUSIVE_WRITER ASSERT_EXCLUSIVE_WRITER_SCOPED
230*4882a593Smuzhiyun                ASSERT_EXCLUSIVE_ACCESS ASSERT_EXCLUSIVE_ACCESS_SCOPED
231*4882a593Smuzhiyun                ASSERT_EXCLUSIVE_BITS
232*4882a593Smuzhiyun
233*4882a593SmuzhiyunImplementation Details
234*4882a593Smuzhiyun----------------------
235*4882a593Smuzhiyun
236*4882a593SmuzhiyunKCSAN relies on observing that two accesses happen concurrently. Crucially, we
237*4882a593Smuzhiyunwant to (a) increase the chances of observing races (especially for races that
238*4882a593Smuzhiyunmanifest rarely), and (b) be able to actually observe them. We can accomplish
239*4882a593Smuzhiyun(a) by injecting various delays, and (b) by using address watchpoints (or
240*4882a593Smuzhiyunbreakpoints).
241*4882a593Smuzhiyun
242*4882a593SmuzhiyunIf we deliberately stall a memory access, while we have a watchpoint for its
243*4882a593Smuzhiyunaddress set up, and then observe the watchpoint to fire, two accesses to the
244*4882a593Smuzhiyunsame address just raced. Using hardware watchpoints, this is the approach taken
245*4882a593Smuzhiyunin `DataCollider
246*4882a593Smuzhiyun<http://usenix.org/legacy/events/osdi10/tech/full_papers/Erickson.pdf>`_.
247*4882a593SmuzhiyunUnlike DataCollider, KCSAN does not use hardware watchpoints, but instead
248*4882a593Smuzhiyunrelies on compiler instrumentation and "soft watchpoints".
249*4882a593Smuzhiyun
250*4882a593SmuzhiyunIn KCSAN, watchpoints are implemented using an efficient encoding that stores
251*4882a593Smuzhiyunaccess type, size, and address in a long; the benefits of using "soft
252*4882a593Smuzhiyunwatchpoints" are portability and greater flexibility. KCSAN then relies on the
253*4882a593Smuzhiyuncompiler instrumenting plain accesses. For each instrumented plain access:
254*4882a593Smuzhiyun
255*4882a593Smuzhiyun1. Check if a matching watchpoint exists; if yes, and at least one access is a
256*4882a593Smuzhiyun   write, then we encountered a racing access.
257*4882a593Smuzhiyun
258*4882a593Smuzhiyun2. Periodically, if no matching watchpoint exists, set up a watchpoint and
259*4882a593Smuzhiyun   stall for a small randomized delay.
260*4882a593Smuzhiyun
261*4882a593Smuzhiyun3. Also check the data value before the delay, and re-check the data value
262*4882a593Smuzhiyun   after delay; if the values mismatch, we infer a race of unknown origin.
263*4882a593Smuzhiyun
264*4882a593SmuzhiyunTo detect data races between plain and marked accesses, KCSAN also annotates
265*4882a593Smuzhiyunmarked accesses, but only to check if a watchpoint exists; i.e. KCSAN never
266*4882a593Smuzhiyunsets up a watchpoint on marked accesses. By never setting up watchpoints for
267*4882a593Smuzhiyunmarked operations, if all accesses to a variable that is accessed concurrently
268*4882a593Smuzhiyunare properly marked, KCSAN will never trigger a watchpoint and therefore never
269*4882a593Smuzhiyunreport the accesses.
270*4882a593Smuzhiyun
271*4882a593SmuzhiyunKey Properties
272*4882a593Smuzhiyun~~~~~~~~~~~~~~
273*4882a593Smuzhiyun
274*4882a593Smuzhiyun1. **Memory Overhead:**  The overall memory overhead is only a few MiB
275*4882a593Smuzhiyun   depending on configuration. The current implementation uses a small array of
276*4882a593Smuzhiyun   longs to encode watchpoint information, which is negligible.
277*4882a593Smuzhiyun
278*4882a593Smuzhiyun2. **Performance Overhead:** KCSAN's runtime aims to be minimal, using an
279*4882a593Smuzhiyun   efficient watchpoint encoding that does not require acquiring any shared
280*4882a593Smuzhiyun   locks in the fast-path. For kernel boot on a system with 8 CPUs:
281*4882a593Smuzhiyun
282*4882a593Smuzhiyun   - 5.0x slow-down with the default KCSAN config;
283*4882a593Smuzhiyun   - 2.8x slow-down from runtime fast-path overhead only (set very large
284*4882a593Smuzhiyun     ``KCSAN_SKIP_WATCH`` and unset ``KCSAN_SKIP_WATCH_RANDOMIZE``).
285*4882a593Smuzhiyun
286*4882a593Smuzhiyun3. **Annotation Overheads:** Minimal annotations are required outside the KCSAN
287*4882a593Smuzhiyun   runtime. As a result, maintenance overheads are minimal as the kernel
288*4882a593Smuzhiyun   evolves.
289*4882a593Smuzhiyun
290*4882a593Smuzhiyun4. **Detects Racy Writes from Devices:** Due to checking data values upon
291*4882a593Smuzhiyun   setting up watchpoints, racy writes from devices can also be detected.
292*4882a593Smuzhiyun
293*4882a593Smuzhiyun5. **Memory Ordering:** KCSAN is *not* explicitly aware of the LKMM's ordering
294*4882a593Smuzhiyun   rules; this may result in missed data races (false negatives).
295*4882a593Smuzhiyun
296*4882a593Smuzhiyun6. **Analysis Accuracy:** For observed executions, due to using a sampling
297*4882a593Smuzhiyun   strategy, the analysis is *unsound* (false negatives possible), but aims to
298*4882a593Smuzhiyun   be complete (no false positives).
299*4882a593Smuzhiyun
300*4882a593SmuzhiyunAlternatives Considered
301*4882a593Smuzhiyun-----------------------
302*4882a593Smuzhiyun
303*4882a593SmuzhiyunAn alternative data race detection approach for the kernel can be found in the
304*4882a593Smuzhiyun`Kernel Thread Sanitizer (KTSAN) <https://github.com/google/ktsan/wiki>`_.
305*4882a593SmuzhiyunKTSAN is a happens-before data race detector, which explicitly establishes the
306*4882a593Smuzhiyunhappens-before order between memory operations, which can then be used to
307*4882a593Smuzhiyundetermine data races as defined in `Data Races`_.
308*4882a593Smuzhiyun
309*4882a593SmuzhiyunTo build a correct happens-before relation, KTSAN must be aware of all ordering
310*4882a593Smuzhiyunrules of the LKMM and synchronization primitives. Unfortunately, any omission
311*4882a593Smuzhiyunleads to large numbers of false positives, which is especially detrimental in
312*4882a593Smuzhiyunthe context of the kernel which includes numerous custom synchronization
313*4882a593Smuzhiyunmechanisms. To track the happens-before relation, KTSAN's implementation
314*4882a593Smuzhiyunrequires metadata for each memory location (shadow memory), which for each page
315*4882a593Smuzhiyuncorresponds to 4 pages of shadow memory, and can translate into overhead of
316*4882a593Smuzhiyuntens of GiB on a large system.
317