xref: /OK3568_Linux_fs/kernel/Documentation/vm/hwpoison.rst (revision 4882a59341e53eb6f0b4789bf948001014eff981)
1*4882a593Smuzhiyun.. hwpoison:
2*4882a593Smuzhiyun
3*4882a593Smuzhiyun========
4*4882a593Smuzhiyunhwpoison
5*4882a593Smuzhiyun========
6*4882a593Smuzhiyun
7*4882a593SmuzhiyunWhat is hwpoison?
8*4882a593Smuzhiyun=================
9*4882a593Smuzhiyun
10*4882a593SmuzhiyunUpcoming Intel CPUs have support for recovering from some memory errors
11*4882a593Smuzhiyun(``MCA recovery``). This requires the OS to declare a page "poisoned",
12*4882a593Smuzhiyunkill the processes associated with it and avoid using it in the future.
13*4882a593Smuzhiyun
14*4882a593SmuzhiyunThis patchkit implements the necessary infrastructure in the VM.
15*4882a593Smuzhiyun
16*4882a593SmuzhiyunTo quote the overview comment::
17*4882a593Smuzhiyun
18*4882a593Smuzhiyun	High level machine check handler. Handles pages reported by the
19*4882a593Smuzhiyun	hardware as being corrupted usually due to a 2bit ECC memory or cache
20*4882a593Smuzhiyun	failure.
21*4882a593Smuzhiyun
22*4882a593Smuzhiyun	This focusses on pages detected as corrupted in the background.
23*4882a593Smuzhiyun	When the current CPU tries to consume corruption the currently
24*4882a593Smuzhiyun	running process can just be killed directly instead. This implies
25*4882a593Smuzhiyun	that if the error cannot be handled for some reason it's safe to
26*4882a593Smuzhiyun	just ignore it because no corruption has been consumed yet. Instead
27*4882a593Smuzhiyun	when that happens another machine check will happen.
28*4882a593Smuzhiyun
29*4882a593Smuzhiyun	Handles page cache pages in various states. The tricky part
30*4882a593Smuzhiyun	here is that we can access any page asynchronous to other VM
31*4882a593Smuzhiyun	users, because memory failures could happen anytime and anywhere,
32*4882a593Smuzhiyun	possibly violating some of their assumptions. This is why this code
33*4882a593Smuzhiyun	has to be extremely careful. Generally it tries to use normal locking
34*4882a593Smuzhiyun	rules, as in get the standard locks, even if that means the
35*4882a593Smuzhiyun	error handling takes potentially a long time.
36*4882a593Smuzhiyun
37*4882a593Smuzhiyun	Some of the operations here are somewhat inefficient and have non
38*4882a593Smuzhiyun	linear algorithmic complexity, because the data structures have not
39*4882a593Smuzhiyun	been optimized for this case. This is in particular the case
40*4882a593Smuzhiyun	for the mapping from a vma to a process. Since this case is expected
41*4882a593Smuzhiyun	to be rare we hope we can get away with this.
42*4882a593Smuzhiyun
43*4882a593SmuzhiyunThe code consists of a the high level handler in mm/memory-failure.c,
44*4882a593Smuzhiyuna new page poison bit and various checks in the VM to handle poisoned
45*4882a593Smuzhiyunpages.
46*4882a593Smuzhiyun
47*4882a593SmuzhiyunThe main target right now is KVM guests, but it works for all kinds
48*4882a593Smuzhiyunof applications. KVM support requires a recent qemu-kvm release.
49*4882a593Smuzhiyun
50*4882a593SmuzhiyunFor the KVM use there was need for a new signal type so that
51*4882a593SmuzhiyunKVM can inject the machine check into the guest with the proper
52*4882a593Smuzhiyunaddress. This in theory allows other applications to handle
53*4882a593Smuzhiyunmemory failures too. The expection is that near all applications
54*4882a593Smuzhiyunwon't do that, but some very specialized ones might.
55*4882a593Smuzhiyun
56*4882a593SmuzhiyunFailure recovery modes
57*4882a593Smuzhiyun======================
58*4882a593Smuzhiyun
59*4882a593SmuzhiyunThere are two (actually three) modes memory failure recovery can be in:
60*4882a593Smuzhiyun
61*4882a593Smuzhiyunvm.memory_failure_recovery sysctl set to zero:
62*4882a593Smuzhiyun	All memory failures cause a panic. Do not attempt recovery.
63*4882a593Smuzhiyun	(on x86 this can be also affected by the tolerant level of the
64*4882a593Smuzhiyun	MCE subsystem)
65*4882a593Smuzhiyun
66*4882a593Smuzhiyunearly kill
67*4882a593Smuzhiyun	(can be controlled globally and per process)
68*4882a593Smuzhiyun	Send SIGBUS to the application as soon as the error is detected
69*4882a593Smuzhiyun	This allows applications who can process memory errors in a gentle
70*4882a593Smuzhiyun	way (e.g. drop affected object)
71*4882a593Smuzhiyun	This is the mode used by KVM qemu.
72*4882a593Smuzhiyun
73*4882a593Smuzhiyunlate kill
74*4882a593Smuzhiyun	Send SIGBUS when the application runs into the corrupted page.
75*4882a593Smuzhiyun	This is best for memory error unaware applications and default
76*4882a593Smuzhiyun	Note some pages are always handled as late kill.
77*4882a593Smuzhiyun
78*4882a593SmuzhiyunUser control
79*4882a593Smuzhiyun============
80*4882a593Smuzhiyun
81*4882a593Smuzhiyunvm.memory_failure_recovery
82*4882a593Smuzhiyun	See sysctl.txt
83*4882a593Smuzhiyun
84*4882a593Smuzhiyunvm.memory_failure_early_kill
85*4882a593Smuzhiyun	Enable early kill mode globally
86*4882a593Smuzhiyun
87*4882a593SmuzhiyunPR_MCE_KILL
88*4882a593Smuzhiyun	Set early/late kill mode/revert to system default
89*4882a593Smuzhiyun
90*4882a593Smuzhiyun	arg1: PR_MCE_KILL_CLEAR:
91*4882a593Smuzhiyun		Revert to system default
92*4882a593Smuzhiyun	arg1: PR_MCE_KILL_SET:
93*4882a593Smuzhiyun		arg2 defines thread specific mode
94*4882a593Smuzhiyun
95*4882a593Smuzhiyun		PR_MCE_KILL_EARLY:
96*4882a593Smuzhiyun			Early kill
97*4882a593Smuzhiyun		PR_MCE_KILL_LATE:
98*4882a593Smuzhiyun			Late kill
99*4882a593Smuzhiyun		PR_MCE_KILL_DEFAULT
100*4882a593Smuzhiyun			Use system global default
101*4882a593Smuzhiyun
102*4882a593Smuzhiyun	Note that if you want to have a dedicated thread which handles
103*4882a593Smuzhiyun	the SIGBUS(BUS_MCEERR_AO) on behalf of the process, you should
104*4882a593Smuzhiyun	call prctl(PR_MCE_KILL_EARLY) on the designated thread. Otherwise,
105*4882a593Smuzhiyun	the SIGBUS is sent to the main thread.
106*4882a593Smuzhiyun
107*4882a593SmuzhiyunPR_MCE_KILL_GET
108*4882a593Smuzhiyun	return current mode
109*4882a593Smuzhiyun
110*4882a593SmuzhiyunTesting
111*4882a593Smuzhiyun=======
112*4882a593Smuzhiyun
113*4882a593Smuzhiyun* madvise(MADV_HWPOISON, ....) (as root) - Poison a page in the
114*4882a593Smuzhiyun  process for testing
115*4882a593Smuzhiyun
116*4882a593Smuzhiyun* hwpoison-inject module through debugfs ``/sys/kernel/debug/hwpoison/``
117*4882a593Smuzhiyun
118*4882a593Smuzhiyun  corrupt-pfn
119*4882a593Smuzhiyun	Inject hwpoison fault at PFN echoed into this file. This does
120*4882a593Smuzhiyun	some early filtering to avoid corrupted unintended pages in test suites.
121*4882a593Smuzhiyun
122*4882a593Smuzhiyun  unpoison-pfn
123*4882a593Smuzhiyun	Software-unpoison page at PFN echoed into this file. This way
124*4882a593Smuzhiyun	a page can be reused again.  This only works for Linux
125*4882a593Smuzhiyun	injected failures, not for real memory failures.
126*4882a593Smuzhiyun
127*4882a593Smuzhiyun  Note these injection interfaces are not stable and might change between
128*4882a593Smuzhiyun  kernel versions
129*4882a593Smuzhiyun
130*4882a593Smuzhiyun  corrupt-filter-dev-major, corrupt-filter-dev-minor
131*4882a593Smuzhiyun	Only handle memory failures to pages associated with the file
132*4882a593Smuzhiyun	system defined by block device major/minor.  -1U is the
133*4882a593Smuzhiyun	wildcard value.  This should be only used for testing with
134*4882a593Smuzhiyun	artificial injection.
135*4882a593Smuzhiyun
136*4882a593Smuzhiyun  corrupt-filter-memcg
137*4882a593Smuzhiyun	Limit injection to pages owned by memgroup. Specified by inode
138*4882a593Smuzhiyun	number of the memcg.
139*4882a593Smuzhiyun
140*4882a593Smuzhiyun	Example::
141*4882a593Smuzhiyun
142*4882a593Smuzhiyun		mkdir /sys/fs/cgroup/mem/hwpoison
143*4882a593Smuzhiyun
144*4882a593Smuzhiyun	        usemem -m 100 -s 1000 &
145*4882a593Smuzhiyun		echo `jobs -p` > /sys/fs/cgroup/mem/hwpoison/tasks
146*4882a593Smuzhiyun
147*4882a593Smuzhiyun		memcg_ino=$(ls -id /sys/fs/cgroup/mem/hwpoison | cut -f1 -d' ')
148*4882a593Smuzhiyun		echo $memcg_ino > /debug/hwpoison/corrupt-filter-memcg
149*4882a593Smuzhiyun
150*4882a593Smuzhiyun		page-types -p `pidof init`   --hwpoison  # shall do nothing
151*4882a593Smuzhiyun		page-types -p `pidof usemem` --hwpoison  # poison its pages
152*4882a593Smuzhiyun
153*4882a593Smuzhiyun  corrupt-filter-flags-mask, corrupt-filter-flags-value
154*4882a593Smuzhiyun	When specified, only poison pages if ((page_flags & mask) ==
155*4882a593Smuzhiyun	value).  This allows stress testing of many kinds of
156*4882a593Smuzhiyun	pages. The page_flags are the same as in /proc/kpageflags. The
157*4882a593Smuzhiyun	flag bits are defined in include/linux/kernel-page-flags.h and
158*4882a593Smuzhiyun	documented in Documentation/admin-guide/mm/pagemap.rst
159*4882a593Smuzhiyun
160*4882a593Smuzhiyun* Architecture specific MCE injector
161*4882a593Smuzhiyun
162*4882a593Smuzhiyun  x86 has mce-inject, mce-test
163*4882a593Smuzhiyun
164*4882a593Smuzhiyun  Some portable hwpoison test programs in mce-test, see below.
165*4882a593Smuzhiyun
166*4882a593SmuzhiyunReferences
167*4882a593Smuzhiyun==========
168*4882a593Smuzhiyun
169*4882a593Smuzhiyunhttp://halobates.de/mce-lc09-2.pdf
170*4882a593Smuzhiyun	Overview presentation from LinuxCon 09
171*4882a593Smuzhiyun
172*4882a593Smuzhiyungit://git.kernel.org/pub/scm/utils/cpu/mce/mce-test.git
173*4882a593Smuzhiyun	Test suite (hwpoison specific portable tests in tsrc)
174*4882a593Smuzhiyun
175*4882a593Smuzhiyungit://git.kernel.org/pub/scm/utils/cpu/mce/mce-inject.git
176*4882a593Smuzhiyun	x86 specific injector
177*4882a593Smuzhiyun
178*4882a593Smuzhiyun
179*4882a593SmuzhiyunLimitations
180*4882a593Smuzhiyun===========
181*4882a593Smuzhiyun- Not all page types are supported and never will. Most kernel internal
182*4882a593Smuzhiyun  objects cannot be recovered, only LRU pages for now.
183*4882a593Smuzhiyun- Right now hugepage support is missing.
184*4882a593Smuzhiyun
185*4882a593Smuzhiyun---
186*4882a593SmuzhiyunAndi Kleen, Oct 2009
187