1*4882a593Smuzhiyun.. hwpoison: 2*4882a593Smuzhiyun 3*4882a593Smuzhiyun======== 4*4882a593Smuzhiyunhwpoison 5*4882a593Smuzhiyun======== 6*4882a593Smuzhiyun 7*4882a593SmuzhiyunWhat is hwpoison? 8*4882a593Smuzhiyun================= 9*4882a593Smuzhiyun 10*4882a593SmuzhiyunUpcoming Intel CPUs have support for recovering from some memory errors 11*4882a593Smuzhiyun(``MCA recovery``). This requires the OS to declare a page "poisoned", 12*4882a593Smuzhiyunkill the processes associated with it and avoid using it in the future. 13*4882a593Smuzhiyun 14*4882a593SmuzhiyunThis patchkit implements the necessary infrastructure in the VM. 15*4882a593Smuzhiyun 16*4882a593SmuzhiyunTo quote the overview comment:: 17*4882a593Smuzhiyun 18*4882a593Smuzhiyun High level machine check handler. Handles pages reported by the 19*4882a593Smuzhiyun hardware as being corrupted usually due to a 2bit ECC memory or cache 20*4882a593Smuzhiyun failure. 21*4882a593Smuzhiyun 22*4882a593Smuzhiyun This focusses on pages detected as corrupted in the background. 23*4882a593Smuzhiyun When the current CPU tries to consume corruption the currently 24*4882a593Smuzhiyun running process can just be killed directly instead. This implies 25*4882a593Smuzhiyun that if the error cannot be handled for some reason it's safe to 26*4882a593Smuzhiyun just ignore it because no corruption has been consumed yet. Instead 27*4882a593Smuzhiyun when that happens another machine check will happen. 28*4882a593Smuzhiyun 29*4882a593Smuzhiyun Handles page cache pages in various states. The tricky part 30*4882a593Smuzhiyun here is that we can access any page asynchronous to other VM 31*4882a593Smuzhiyun users, because memory failures could happen anytime and anywhere, 32*4882a593Smuzhiyun possibly violating some of their assumptions. This is why this code 33*4882a593Smuzhiyun has to be extremely careful. Generally it tries to use normal locking 34*4882a593Smuzhiyun rules, as in get the standard locks, even if that means the 35*4882a593Smuzhiyun error handling takes potentially a long time. 36*4882a593Smuzhiyun 37*4882a593Smuzhiyun Some of the operations here are somewhat inefficient and have non 38*4882a593Smuzhiyun linear algorithmic complexity, because the data structures have not 39*4882a593Smuzhiyun been optimized for this case. This is in particular the case 40*4882a593Smuzhiyun for the mapping from a vma to a process. Since this case is expected 41*4882a593Smuzhiyun to be rare we hope we can get away with this. 42*4882a593Smuzhiyun 43*4882a593SmuzhiyunThe code consists of a the high level handler in mm/memory-failure.c, 44*4882a593Smuzhiyuna new page poison bit and various checks in the VM to handle poisoned 45*4882a593Smuzhiyunpages. 46*4882a593Smuzhiyun 47*4882a593SmuzhiyunThe main target right now is KVM guests, but it works for all kinds 48*4882a593Smuzhiyunof applications. KVM support requires a recent qemu-kvm release. 49*4882a593Smuzhiyun 50*4882a593SmuzhiyunFor the KVM use there was need for a new signal type so that 51*4882a593SmuzhiyunKVM can inject the machine check into the guest with the proper 52*4882a593Smuzhiyunaddress. This in theory allows other applications to handle 53*4882a593Smuzhiyunmemory failures too. The expection is that near all applications 54*4882a593Smuzhiyunwon't do that, but some very specialized ones might. 55*4882a593Smuzhiyun 56*4882a593SmuzhiyunFailure recovery modes 57*4882a593Smuzhiyun====================== 58*4882a593Smuzhiyun 59*4882a593SmuzhiyunThere are two (actually three) modes memory failure recovery can be in: 60*4882a593Smuzhiyun 61*4882a593Smuzhiyunvm.memory_failure_recovery sysctl set to zero: 62*4882a593Smuzhiyun All memory failures cause a panic. Do not attempt recovery. 63*4882a593Smuzhiyun (on x86 this can be also affected by the tolerant level of the 64*4882a593Smuzhiyun MCE subsystem) 65*4882a593Smuzhiyun 66*4882a593Smuzhiyunearly kill 67*4882a593Smuzhiyun (can be controlled globally and per process) 68*4882a593Smuzhiyun Send SIGBUS to the application as soon as the error is detected 69*4882a593Smuzhiyun This allows applications who can process memory errors in a gentle 70*4882a593Smuzhiyun way (e.g. drop affected object) 71*4882a593Smuzhiyun This is the mode used by KVM qemu. 72*4882a593Smuzhiyun 73*4882a593Smuzhiyunlate kill 74*4882a593Smuzhiyun Send SIGBUS when the application runs into the corrupted page. 75*4882a593Smuzhiyun This is best for memory error unaware applications and default 76*4882a593Smuzhiyun Note some pages are always handled as late kill. 77*4882a593Smuzhiyun 78*4882a593SmuzhiyunUser control 79*4882a593Smuzhiyun============ 80*4882a593Smuzhiyun 81*4882a593Smuzhiyunvm.memory_failure_recovery 82*4882a593Smuzhiyun See sysctl.txt 83*4882a593Smuzhiyun 84*4882a593Smuzhiyunvm.memory_failure_early_kill 85*4882a593Smuzhiyun Enable early kill mode globally 86*4882a593Smuzhiyun 87*4882a593SmuzhiyunPR_MCE_KILL 88*4882a593Smuzhiyun Set early/late kill mode/revert to system default 89*4882a593Smuzhiyun 90*4882a593Smuzhiyun arg1: PR_MCE_KILL_CLEAR: 91*4882a593Smuzhiyun Revert to system default 92*4882a593Smuzhiyun arg1: PR_MCE_KILL_SET: 93*4882a593Smuzhiyun arg2 defines thread specific mode 94*4882a593Smuzhiyun 95*4882a593Smuzhiyun PR_MCE_KILL_EARLY: 96*4882a593Smuzhiyun Early kill 97*4882a593Smuzhiyun PR_MCE_KILL_LATE: 98*4882a593Smuzhiyun Late kill 99*4882a593Smuzhiyun PR_MCE_KILL_DEFAULT 100*4882a593Smuzhiyun Use system global default 101*4882a593Smuzhiyun 102*4882a593Smuzhiyun Note that if you want to have a dedicated thread which handles 103*4882a593Smuzhiyun the SIGBUS(BUS_MCEERR_AO) on behalf of the process, you should 104*4882a593Smuzhiyun call prctl(PR_MCE_KILL_EARLY) on the designated thread. Otherwise, 105*4882a593Smuzhiyun the SIGBUS is sent to the main thread. 106*4882a593Smuzhiyun 107*4882a593SmuzhiyunPR_MCE_KILL_GET 108*4882a593Smuzhiyun return current mode 109*4882a593Smuzhiyun 110*4882a593SmuzhiyunTesting 111*4882a593Smuzhiyun======= 112*4882a593Smuzhiyun 113*4882a593Smuzhiyun* madvise(MADV_HWPOISON, ....) (as root) - Poison a page in the 114*4882a593Smuzhiyun process for testing 115*4882a593Smuzhiyun 116*4882a593Smuzhiyun* hwpoison-inject module through debugfs ``/sys/kernel/debug/hwpoison/`` 117*4882a593Smuzhiyun 118*4882a593Smuzhiyun corrupt-pfn 119*4882a593Smuzhiyun Inject hwpoison fault at PFN echoed into this file. This does 120*4882a593Smuzhiyun some early filtering to avoid corrupted unintended pages in test suites. 121*4882a593Smuzhiyun 122*4882a593Smuzhiyun unpoison-pfn 123*4882a593Smuzhiyun Software-unpoison page at PFN echoed into this file. This way 124*4882a593Smuzhiyun a page can be reused again. This only works for Linux 125*4882a593Smuzhiyun injected failures, not for real memory failures. 126*4882a593Smuzhiyun 127*4882a593Smuzhiyun Note these injection interfaces are not stable and might change between 128*4882a593Smuzhiyun kernel versions 129*4882a593Smuzhiyun 130*4882a593Smuzhiyun corrupt-filter-dev-major, corrupt-filter-dev-minor 131*4882a593Smuzhiyun Only handle memory failures to pages associated with the file 132*4882a593Smuzhiyun system defined by block device major/minor. -1U is the 133*4882a593Smuzhiyun wildcard value. This should be only used for testing with 134*4882a593Smuzhiyun artificial injection. 135*4882a593Smuzhiyun 136*4882a593Smuzhiyun corrupt-filter-memcg 137*4882a593Smuzhiyun Limit injection to pages owned by memgroup. Specified by inode 138*4882a593Smuzhiyun number of the memcg. 139*4882a593Smuzhiyun 140*4882a593Smuzhiyun Example:: 141*4882a593Smuzhiyun 142*4882a593Smuzhiyun mkdir /sys/fs/cgroup/mem/hwpoison 143*4882a593Smuzhiyun 144*4882a593Smuzhiyun usemem -m 100 -s 1000 & 145*4882a593Smuzhiyun echo `jobs -p` > /sys/fs/cgroup/mem/hwpoison/tasks 146*4882a593Smuzhiyun 147*4882a593Smuzhiyun memcg_ino=$(ls -id /sys/fs/cgroup/mem/hwpoison | cut -f1 -d' ') 148*4882a593Smuzhiyun echo $memcg_ino > /debug/hwpoison/corrupt-filter-memcg 149*4882a593Smuzhiyun 150*4882a593Smuzhiyun page-types -p `pidof init` --hwpoison # shall do nothing 151*4882a593Smuzhiyun page-types -p `pidof usemem` --hwpoison # poison its pages 152*4882a593Smuzhiyun 153*4882a593Smuzhiyun corrupt-filter-flags-mask, corrupt-filter-flags-value 154*4882a593Smuzhiyun When specified, only poison pages if ((page_flags & mask) == 155*4882a593Smuzhiyun value). This allows stress testing of many kinds of 156*4882a593Smuzhiyun pages. The page_flags are the same as in /proc/kpageflags. The 157*4882a593Smuzhiyun flag bits are defined in include/linux/kernel-page-flags.h and 158*4882a593Smuzhiyun documented in Documentation/admin-guide/mm/pagemap.rst 159*4882a593Smuzhiyun 160*4882a593Smuzhiyun* Architecture specific MCE injector 161*4882a593Smuzhiyun 162*4882a593Smuzhiyun x86 has mce-inject, mce-test 163*4882a593Smuzhiyun 164*4882a593Smuzhiyun Some portable hwpoison test programs in mce-test, see below. 165*4882a593Smuzhiyun 166*4882a593SmuzhiyunReferences 167*4882a593Smuzhiyun========== 168*4882a593Smuzhiyun 169*4882a593Smuzhiyunhttp://halobates.de/mce-lc09-2.pdf 170*4882a593Smuzhiyun Overview presentation from LinuxCon 09 171*4882a593Smuzhiyun 172*4882a593Smuzhiyungit://git.kernel.org/pub/scm/utils/cpu/mce/mce-test.git 173*4882a593Smuzhiyun Test suite (hwpoison specific portable tests in tsrc) 174*4882a593Smuzhiyun 175*4882a593Smuzhiyungit://git.kernel.org/pub/scm/utils/cpu/mce/mce-inject.git 176*4882a593Smuzhiyun x86 specific injector 177*4882a593Smuzhiyun 178*4882a593Smuzhiyun 179*4882a593SmuzhiyunLimitations 180*4882a593Smuzhiyun=========== 181*4882a593Smuzhiyun- Not all page types are supported and never will. Most kernel internal 182*4882a593Smuzhiyun objects cannot be recovered, only LRU pages for now. 183*4882a593Smuzhiyun- Right now hugepage support is missing. 184*4882a593Smuzhiyun 185*4882a593Smuzhiyun--- 186*4882a593SmuzhiyunAndi Kleen, Oct 2009 187