1*4882a593SmuzhiyunThe Kernel Concurrency Sanitizer (KCSAN) 2*4882a593Smuzhiyun======================================== 3*4882a593Smuzhiyun 4*4882a593SmuzhiyunThe Kernel Concurrency Sanitizer (KCSAN) is a dynamic race detector, which 5*4882a593Smuzhiyunrelies on compile-time instrumentation, and uses a watchpoint-based sampling 6*4882a593Smuzhiyunapproach to detect races. KCSAN's primary purpose is to detect `data races`_. 7*4882a593Smuzhiyun 8*4882a593SmuzhiyunUsage 9*4882a593Smuzhiyun----- 10*4882a593Smuzhiyun 11*4882a593SmuzhiyunKCSAN is supported by both GCC and Clang. With GCC we require version 11 or 12*4882a593Smuzhiyunlater, and with Clang also require version 11 or later. 13*4882a593Smuzhiyun 14*4882a593SmuzhiyunTo enable KCSAN configure the kernel with:: 15*4882a593Smuzhiyun 16*4882a593Smuzhiyun CONFIG_KCSAN = y 17*4882a593Smuzhiyun 18*4882a593SmuzhiyunKCSAN provides several other configuration options to customize behaviour (see 19*4882a593Smuzhiyunthe respective help text in ``lib/Kconfig.kcsan`` for more info). 20*4882a593Smuzhiyun 21*4882a593SmuzhiyunError reports 22*4882a593Smuzhiyun~~~~~~~~~~~~~ 23*4882a593Smuzhiyun 24*4882a593SmuzhiyunA typical data race report looks like this:: 25*4882a593Smuzhiyun 26*4882a593Smuzhiyun ================================================================== 27*4882a593Smuzhiyun BUG: KCSAN: data-race in generic_permission / kernfs_refresh_inode 28*4882a593Smuzhiyun 29*4882a593Smuzhiyun write to 0xffff8fee4c40700c of 4 bytes by task 175 on cpu 4: 30*4882a593Smuzhiyun kernfs_refresh_inode+0x70/0x170 31*4882a593Smuzhiyun kernfs_iop_permission+0x4f/0x90 32*4882a593Smuzhiyun inode_permission+0x190/0x200 33*4882a593Smuzhiyun link_path_walk.part.0+0x503/0x8e0 34*4882a593Smuzhiyun path_lookupat.isra.0+0x69/0x4d0 35*4882a593Smuzhiyun filename_lookup+0x136/0x280 36*4882a593Smuzhiyun user_path_at_empty+0x47/0x60 37*4882a593Smuzhiyun vfs_statx+0x9b/0x130 38*4882a593Smuzhiyun __do_sys_newlstat+0x50/0xb0 39*4882a593Smuzhiyun __x64_sys_newlstat+0x37/0x50 40*4882a593Smuzhiyun do_syscall_64+0x85/0x260 41*4882a593Smuzhiyun entry_SYSCALL_64_after_hwframe+0x44/0xa9 42*4882a593Smuzhiyun 43*4882a593Smuzhiyun read to 0xffff8fee4c40700c of 4 bytes by task 166 on cpu 6: 44*4882a593Smuzhiyun generic_permission+0x5b/0x2a0 45*4882a593Smuzhiyun kernfs_iop_permission+0x66/0x90 46*4882a593Smuzhiyun inode_permission+0x190/0x200 47*4882a593Smuzhiyun link_path_walk.part.0+0x503/0x8e0 48*4882a593Smuzhiyun path_lookupat.isra.0+0x69/0x4d0 49*4882a593Smuzhiyun filename_lookup+0x136/0x280 50*4882a593Smuzhiyun user_path_at_empty+0x47/0x60 51*4882a593Smuzhiyun do_faccessat+0x11a/0x390 52*4882a593Smuzhiyun __x64_sys_access+0x3c/0x50 53*4882a593Smuzhiyun do_syscall_64+0x85/0x260 54*4882a593Smuzhiyun entry_SYSCALL_64_after_hwframe+0x44/0xa9 55*4882a593Smuzhiyun 56*4882a593Smuzhiyun Reported by Kernel Concurrency Sanitizer on: 57*4882a593Smuzhiyun CPU: 6 PID: 166 Comm: systemd-journal Not tainted 5.3.0-rc7+ #1 58*4882a593Smuzhiyun Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014 59*4882a593Smuzhiyun ================================================================== 60*4882a593Smuzhiyun 61*4882a593SmuzhiyunThe header of the report provides a short summary of the functions involved in 62*4882a593Smuzhiyunthe race. It is followed by the access types and stack traces of the 2 threads 63*4882a593Smuzhiyuninvolved in the data race. 64*4882a593Smuzhiyun 65*4882a593SmuzhiyunThe other less common type of data race report looks like this:: 66*4882a593Smuzhiyun 67*4882a593Smuzhiyun ================================================================== 68*4882a593Smuzhiyun BUG: KCSAN: data-race in e1000_clean_rx_irq+0x551/0xb10 69*4882a593Smuzhiyun 70*4882a593Smuzhiyun race at unknown origin, with read to 0xffff933db8a2ae6c of 1 bytes by interrupt on cpu 0: 71*4882a593Smuzhiyun e1000_clean_rx_irq+0x551/0xb10 72*4882a593Smuzhiyun e1000_clean+0x533/0xda0 73*4882a593Smuzhiyun net_rx_action+0x329/0x900 74*4882a593Smuzhiyun __do_softirq+0xdb/0x2db 75*4882a593Smuzhiyun irq_exit+0x9b/0xa0 76*4882a593Smuzhiyun do_IRQ+0x9c/0xf0 77*4882a593Smuzhiyun ret_from_intr+0x0/0x18 78*4882a593Smuzhiyun default_idle+0x3f/0x220 79*4882a593Smuzhiyun arch_cpu_idle+0x21/0x30 80*4882a593Smuzhiyun do_idle+0x1df/0x230 81*4882a593Smuzhiyun cpu_startup_entry+0x14/0x20 82*4882a593Smuzhiyun rest_init+0xc5/0xcb 83*4882a593Smuzhiyun arch_call_rest_init+0x13/0x2b 84*4882a593Smuzhiyun start_kernel+0x6db/0x700 85*4882a593Smuzhiyun 86*4882a593Smuzhiyun Reported by Kernel Concurrency Sanitizer on: 87*4882a593Smuzhiyun CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.3.0-rc7+ #2 88*4882a593Smuzhiyun Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014 89*4882a593Smuzhiyun ================================================================== 90*4882a593Smuzhiyun 91*4882a593SmuzhiyunThis report is generated where it was not possible to determine the other 92*4882a593Smuzhiyunracing thread, but a race was inferred due to the data value of the watched 93*4882a593Smuzhiyunmemory location having changed. These can occur either due to missing 94*4882a593Smuzhiyuninstrumentation or e.g. DMA accesses. These reports will only be generated if 95*4882a593Smuzhiyun``CONFIG_KCSAN_REPORT_RACE_UNKNOWN_ORIGIN=y`` (selected by default). 96*4882a593Smuzhiyun 97*4882a593SmuzhiyunSelective analysis 98*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~ 99*4882a593Smuzhiyun 100*4882a593SmuzhiyunIt may be desirable to disable data race detection for specific accesses, 101*4882a593Smuzhiyunfunctions, compilation units, or entire subsystems. For static blacklisting, 102*4882a593Smuzhiyunthe below options are available: 103*4882a593Smuzhiyun 104*4882a593Smuzhiyun* KCSAN understands the ``data_race(expr)`` annotation, which tells KCSAN that 105*4882a593Smuzhiyun any data races due to accesses in ``expr`` should be ignored and resulting 106*4882a593Smuzhiyun behaviour when encountering a data race is deemed safe. 107*4882a593Smuzhiyun 108*4882a593Smuzhiyun* Disabling data race detection for entire functions can be accomplished by 109*4882a593Smuzhiyun using the function attribute ``__no_kcsan``:: 110*4882a593Smuzhiyun 111*4882a593Smuzhiyun __no_kcsan 112*4882a593Smuzhiyun void foo(void) { 113*4882a593Smuzhiyun ... 114*4882a593Smuzhiyun 115*4882a593Smuzhiyun To dynamically limit for which functions to generate reports, see the 116*4882a593Smuzhiyun `DebugFS interface`_ blacklist/whitelist feature. 117*4882a593Smuzhiyun 118*4882a593Smuzhiyun* To disable data race detection for a particular compilation unit, add to the 119*4882a593Smuzhiyun ``Makefile``:: 120*4882a593Smuzhiyun 121*4882a593Smuzhiyun KCSAN_SANITIZE_file.o := n 122*4882a593Smuzhiyun 123*4882a593Smuzhiyun* To disable data race detection for all compilation units listed in a 124*4882a593Smuzhiyun ``Makefile``, add to the respective ``Makefile``:: 125*4882a593Smuzhiyun 126*4882a593Smuzhiyun KCSAN_SANITIZE := n 127*4882a593Smuzhiyun 128*4882a593SmuzhiyunFurthermore, it is possible to tell KCSAN to show or hide entire classes of 129*4882a593Smuzhiyundata races, depending on preferences. These can be changed via the following 130*4882a593SmuzhiyunKconfig options: 131*4882a593Smuzhiyun 132*4882a593Smuzhiyun* ``CONFIG_KCSAN_REPORT_VALUE_CHANGE_ONLY``: If enabled and a conflicting write 133*4882a593Smuzhiyun is observed via a watchpoint, but the data value of the memory location was 134*4882a593Smuzhiyun observed to remain unchanged, do not report the data race. 135*4882a593Smuzhiyun 136*4882a593Smuzhiyun* ``CONFIG_KCSAN_ASSUME_PLAIN_WRITES_ATOMIC``: Assume that plain aligned writes 137*4882a593Smuzhiyun up to word size are atomic by default. Assumes that such writes are not 138*4882a593Smuzhiyun subject to unsafe compiler optimizations resulting in data races. The option 139*4882a593Smuzhiyun causes KCSAN to not report data races due to conflicts where the only plain 140*4882a593Smuzhiyun accesses are aligned writes up to word size. 141*4882a593Smuzhiyun 142*4882a593SmuzhiyunDebugFS interface 143*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~ 144*4882a593Smuzhiyun 145*4882a593SmuzhiyunThe file ``/sys/kernel/debug/kcsan`` provides the following interface: 146*4882a593Smuzhiyun 147*4882a593Smuzhiyun* Reading ``/sys/kernel/debug/kcsan`` returns various runtime statistics. 148*4882a593Smuzhiyun 149*4882a593Smuzhiyun* Writing ``on`` or ``off`` to ``/sys/kernel/debug/kcsan`` allows turning KCSAN 150*4882a593Smuzhiyun on or off, respectively. 151*4882a593Smuzhiyun 152*4882a593Smuzhiyun* Writing ``!some_func_name`` to ``/sys/kernel/debug/kcsan`` adds 153*4882a593Smuzhiyun ``some_func_name`` to the report filter list, which (by default) blacklists 154*4882a593Smuzhiyun reporting data races where either one of the top stackframes are a function 155*4882a593Smuzhiyun in the list. 156*4882a593Smuzhiyun 157*4882a593Smuzhiyun* Writing either ``blacklist`` or ``whitelist`` to ``/sys/kernel/debug/kcsan`` 158*4882a593Smuzhiyun changes the report filtering behaviour. For example, the blacklist feature 159*4882a593Smuzhiyun can be used to silence frequently occurring data races; the whitelist feature 160*4882a593Smuzhiyun can help with reproduction and testing of fixes. 161*4882a593Smuzhiyun 162*4882a593SmuzhiyunTuning performance 163*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~ 164*4882a593Smuzhiyun 165*4882a593SmuzhiyunCore parameters that affect KCSAN's overall performance and bug detection 166*4882a593Smuzhiyunability are exposed as kernel command-line arguments whose defaults can also be 167*4882a593Smuzhiyunchanged via the corresponding Kconfig options. 168*4882a593Smuzhiyun 169*4882a593Smuzhiyun* ``kcsan.skip_watch`` (``CONFIG_KCSAN_SKIP_WATCH``): Number of per-CPU memory 170*4882a593Smuzhiyun operations to skip, before another watchpoint is set up. Setting up 171*4882a593Smuzhiyun watchpoints more frequently will result in the likelihood of races to be 172*4882a593Smuzhiyun observed to increase. This parameter has the most significant impact on 173*4882a593Smuzhiyun overall system performance and race detection ability. 174*4882a593Smuzhiyun 175*4882a593Smuzhiyun* ``kcsan.udelay_task`` (``CONFIG_KCSAN_UDELAY_TASK``): For tasks, the 176*4882a593Smuzhiyun microsecond delay to stall execution after a watchpoint has been set up. 177*4882a593Smuzhiyun Larger values result in the window in which we may observe a race to 178*4882a593Smuzhiyun increase. 179*4882a593Smuzhiyun 180*4882a593Smuzhiyun* ``kcsan.udelay_interrupt`` (``CONFIG_KCSAN_UDELAY_INTERRUPT``): For 181*4882a593Smuzhiyun interrupts, the microsecond delay to stall execution after a watchpoint has 182*4882a593Smuzhiyun been set up. Interrupts have tighter latency requirements, and their delay 183*4882a593Smuzhiyun should generally be smaller than the one chosen for tasks. 184*4882a593Smuzhiyun 185*4882a593SmuzhiyunThey may be tweaked at runtime via ``/sys/module/kcsan/parameters/``. 186*4882a593Smuzhiyun 187*4882a593SmuzhiyunData Races 188*4882a593Smuzhiyun---------- 189*4882a593Smuzhiyun 190*4882a593SmuzhiyunIn an execution, two memory accesses form a *data race* if they *conflict*, 191*4882a593Smuzhiyunthey happen concurrently in different threads, and at least one of them is a 192*4882a593Smuzhiyun*plain access*; they *conflict* if both access the same memory location, and at 193*4882a593Smuzhiyunleast one is a write. For a more thorough discussion and definition, see `"Plain 194*4882a593SmuzhiyunAccesses and Data Races" in the LKMM`_. 195*4882a593Smuzhiyun 196*4882a593Smuzhiyun.. _"Plain Accesses and Data Races" in the LKMM: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/memory-model/Documentation/explanation.txt#n1922 197*4882a593Smuzhiyun 198*4882a593SmuzhiyunRelationship with the Linux-Kernel Memory Consistency Model (LKMM) 199*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 200*4882a593Smuzhiyun 201*4882a593SmuzhiyunThe LKMM defines the propagation and ordering rules of various memory 202*4882a593Smuzhiyunoperations, which gives developers the ability to reason about concurrent code. 203*4882a593SmuzhiyunUltimately this allows to determine the possible executions of concurrent code, 204*4882a593Smuzhiyunand if that code is free from data races. 205*4882a593Smuzhiyun 206*4882a593SmuzhiyunKCSAN is aware of *marked atomic operations* (``READ_ONCE``, ``WRITE_ONCE``, 207*4882a593Smuzhiyun``atomic_*``, etc.), but is oblivious of any ordering guarantees and simply 208*4882a593Smuzhiyunassumes that memory barriers are placed correctly. In other words, KCSAN 209*4882a593Smuzhiyunassumes that as long as a plain access is not observed to race with another 210*4882a593Smuzhiyunconflicting access, memory operations are correctly ordered. 211*4882a593Smuzhiyun 212*4882a593SmuzhiyunThis means that KCSAN will not report *potential* data races due to missing 213*4882a593Smuzhiyunmemory ordering. Developers should therefore carefully consider the required 214*4882a593Smuzhiyunmemory ordering requirements that remain unchecked. If, however, missing 215*4882a593Smuzhiyunmemory ordering (that is observable with a particular compiler and 216*4882a593Smuzhiyunarchitecture) leads to an observable data race (e.g. entering a critical 217*4882a593Smuzhiyunsection erroneously), KCSAN would report the resulting data race. 218*4882a593Smuzhiyun 219*4882a593SmuzhiyunRace Detection Beyond Data Races 220*4882a593Smuzhiyun-------------------------------- 221*4882a593Smuzhiyun 222*4882a593SmuzhiyunFor code with complex concurrency design, race-condition bugs may not always 223*4882a593Smuzhiyunmanifest as data races. Race conditions occur if concurrently executing 224*4882a593Smuzhiyunoperations result in unexpected system behaviour. On the other hand, data races 225*4882a593Smuzhiyunare defined at the C-language level. The following macros can be used to check 226*4882a593Smuzhiyunproperties of concurrent code where bugs would not manifest as data races. 227*4882a593Smuzhiyun 228*4882a593Smuzhiyun.. kernel-doc:: include/linux/kcsan-checks.h 229*4882a593Smuzhiyun :functions: ASSERT_EXCLUSIVE_WRITER ASSERT_EXCLUSIVE_WRITER_SCOPED 230*4882a593Smuzhiyun ASSERT_EXCLUSIVE_ACCESS ASSERT_EXCLUSIVE_ACCESS_SCOPED 231*4882a593Smuzhiyun ASSERT_EXCLUSIVE_BITS 232*4882a593Smuzhiyun 233*4882a593SmuzhiyunImplementation Details 234*4882a593Smuzhiyun---------------------- 235*4882a593Smuzhiyun 236*4882a593SmuzhiyunKCSAN relies on observing that two accesses happen concurrently. Crucially, we 237*4882a593Smuzhiyunwant to (a) increase the chances of observing races (especially for races that 238*4882a593Smuzhiyunmanifest rarely), and (b) be able to actually observe them. We can accomplish 239*4882a593Smuzhiyun(a) by injecting various delays, and (b) by using address watchpoints (or 240*4882a593Smuzhiyunbreakpoints). 241*4882a593Smuzhiyun 242*4882a593SmuzhiyunIf we deliberately stall a memory access, while we have a watchpoint for its 243*4882a593Smuzhiyunaddress set up, and then observe the watchpoint to fire, two accesses to the 244*4882a593Smuzhiyunsame address just raced. Using hardware watchpoints, this is the approach taken 245*4882a593Smuzhiyunin `DataCollider 246*4882a593Smuzhiyun<http://usenix.org/legacy/events/osdi10/tech/full_papers/Erickson.pdf>`_. 247*4882a593SmuzhiyunUnlike DataCollider, KCSAN does not use hardware watchpoints, but instead 248*4882a593Smuzhiyunrelies on compiler instrumentation and "soft watchpoints". 249*4882a593Smuzhiyun 250*4882a593SmuzhiyunIn KCSAN, watchpoints are implemented using an efficient encoding that stores 251*4882a593Smuzhiyunaccess type, size, and address in a long; the benefits of using "soft 252*4882a593Smuzhiyunwatchpoints" are portability and greater flexibility. KCSAN then relies on the 253*4882a593Smuzhiyuncompiler instrumenting plain accesses. For each instrumented plain access: 254*4882a593Smuzhiyun 255*4882a593Smuzhiyun1. Check if a matching watchpoint exists; if yes, and at least one access is a 256*4882a593Smuzhiyun write, then we encountered a racing access. 257*4882a593Smuzhiyun 258*4882a593Smuzhiyun2. Periodically, if no matching watchpoint exists, set up a watchpoint and 259*4882a593Smuzhiyun stall for a small randomized delay. 260*4882a593Smuzhiyun 261*4882a593Smuzhiyun3. Also check the data value before the delay, and re-check the data value 262*4882a593Smuzhiyun after delay; if the values mismatch, we infer a race of unknown origin. 263*4882a593Smuzhiyun 264*4882a593SmuzhiyunTo detect data races between plain and marked accesses, KCSAN also annotates 265*4882a593Smuzhiyunmarked accesses, but only to check if a watchpoint exists; i.e. KCSAN never 266*4882a593Smuzhiyunsets up a watchpoint on marked accesses. By never setting up watchpoints for 267*4882a593Smuzhiyunmarked operations, if all accesses to a variable that is accessed concurrently 268*4882a593Smuzhiyunare properly marked, KCSAN will never trigger a watchpoint and therefore never 269*4882a593Smuzhiyunreport the accesses. 270*4882a593Smuzhiyun 271*4882a593SmuzhiyunKey Properties 272*4882a593Smuzhiyun~~~~~~~~~~~~~~ 273*4882a593Smuzhiyun 274*4882a593Smuzhiyun1. **Memory Overhead:** The overall memory overhead is only a few MiB 275*4882a593Smuzhiyun depending on configuration. The current implementation uses a small array of 276*4882a593Smuzhiyun longs to encode watchpoint information, which is negligible. 277*4882a593Smuzhiyun 278*4882a593Smuzhiyun2. **Performance Overhead:** KCSAN's runtime aims to be minimal, using an 279*4882a593Smuzhiyun efficient watchpoint encoding that does not require acquiring any shared 280*4882a593Smuzhiyun locks in the fast-path. For kernel boot on a system with 8 CPUs: 281*4882a593Smuzhiyun 282*4882a593Smuzhiyun - 5.0x slow-down with the default KCSAN config; 283*4882a593Smuzhiyun - 2.8x slow-down from runtime fast-path overhead only (set very large 284*4882a593Smuzhiyun ``KCSAN_SKIP_WATCH`` and unset ``KCSAN_SKIP_WATCH_RANDOMIZE``). 285*4882a593Smuzhiyun 286*4882a593Smuzhiyun3. **Annotation Overheads:** Minimal annotations are required outside the KCSAN 287*4882a593Smuzhiyun runtime. As a result, maintenance overheads are minimal as the kernel 288*4882a593Smuzhiyun evolves. 289*4882a593Smuzhiyun 290*4882a593Smuzhiyun4. **Detects Racy Writes from Devices:** Due to checking data values upon 291*4882a593Smuzhiyun setting up watchpoints, racy writes from devices can also be detected. 292*4882a593Smuzhiyun 293*4882a593Smuzhiyun5. **Memory Ordering:** KCSAN is *not* explicitly aware of the LKMM's ordering 294*4882a593Smuzhiyun rules; this may result in missed data races (false negatives). 295*4882a593Smuzhiyun 296*4882a593Smuzhiyun6. **Analysis Accuracy:** For observed executions, due to using a sampling 297*4882a593Smuzhiyun strategy, the analysis is *unsound* (false negatives possible), but aims to 298*4882a593Smuzhiyun be complete (no false positives). 299*4882a593Smuzhiyun 300*4882a593SmuzhiyunAlternatives Considered 301*4882a593Smuzhiyun----------------------- 302*4882a593Smuzhiyun 303*4882a593SmuzhiyunAn alternative data race detection approach for the kernel can be found in the 304*4882a593Smuzhiyun`Kernel Thread Sanitizer (KTSAN) <https://github.com/google/ktsan/wiki>`_. 305*4882a593SmuzhiyunKTSAN is a happens-before data race detector, which explicitly establishes the 306*4882a593Smuzhiyunhappens-before order between memory operations, which can then be used to 307*4882a593Smuzhiyundetermine data races as defined in `Data Races`_. 308*4882a593Smuzhiyun 309*4882a593SmuzhiyunTo build a correct happens-before relation, KTSAN must be aware of all ordering 310*4882a593Smuzhiyunrules of the LKMM and synchronization primitives. Unfortunately, any omission 311*4882a593Smuzhiyunleads to large numbers of false positives, which is especially detrimental in 312*4882a593Smuzhiyunthe context of the kernel which includes numerous custom synchronization 313*4882a593Smuzhiyunmechanisms. To track the happens-before relation, KTSAN's implementation 314*4882a593Smuzhiyunrequires metadata for each memory location (shadow memory), which for each page 315*4882a593Smuzhiyuncorresponds to 4 pages of shadow memory, and can translate into overhead of 316*4882a593Smuzhiyuntens of GiB on a large system. 317