xref: /OK3568_Linux_fs/kernel/Documentation/x86/x86_64/machinecheck.rst (revision 4882a59341e53eb6f0b4789bf948001014eff981)
1*4882a593Smuzhiyun.. SPDX-License-Identifier: GPL-2.0
2*4882a593Smuzhiyun
3*4882a593Smuzhiyun===============================================================
4*4882a593SmuzhiyunConfigurable sysfs parameters for the x86-64 machine check code
5*4882a593Smuzhiyun===============================================================
6*4882a593Smuzhiyun
7*4882a593SmuzhiyunMachine checks report internal hardware error conditions detected
8*4882a593Smuzhiyunby the CPU. Uncorrected errors typically cause a machine check
9*4882a593Smuzhiyun(often with panic), corrected ones cause a machine check log entry.
10*4882a593Smuzhiyun
11*4882a593SmuzhiyunMachine checks are organized in banks (normally associated with
12*4882a593Smuzhiyuna hardware subsystem) and subevents in a bank. The exact meaning
13*4882a593Smuzhiyunof the banks and subevent is CPU specific.
14*4882a593Smuzhiyun
15*4882a593Smuzhiyunmcelog knows how to decode them.
16*4882a593Smuzhiyun
17*4882a593SmuzhiyunWhen you see the "Machine check errors logged" message in the system
18*4882a593Smuzhiyunlog then mcelog should run to collect and decode machine check entries
19*4882a593Smuzhiyunfrom /dev/mcelog. Normally mcelog should be run regularly from a cronjob.
20*4882a593Smuzhiyun
21*4882a593SmuzhiyunEach CPU has a directory in /sys/devices/system/machinecheck/machinecheckN
22*4882a593Smuzhiyun(N = CPU number).
23*4882a593Smuzhiyun
24*4882a593SmuzhiyunThe directory contains some configurable entries:
25*4882a593Smuzhiyun
26*4882a593SmuzhiyunbankNctl
27*4882a593Smuzhiyun	(N bank number)
28*4882a593Smuzhiyun
29*4882a593Smuzhiyun	64bit Hex bitmask enabling/disabling specific subevents for bank N
30*4882a593Smuzhiyun	When a bit in the bitmask is zero then the respective
31*4882a593Smuzhiyun	subevent will not be reported.
32*4882a593Smuzhiyun	By default all events are enabled.
33*4882a593Smuzhiyun	Note that BIOS maintain another mask to disable specific events
34*4882a593Smuzhiyun	per bank.  This is not visible here
35*4882a593Smuzhiyun
36*4882a593SmuzhiyunThe following entries appear for each CPU, but they are truly shared
37*4882a593Smuzhiyunbetween all CPUs.
38*4882a593Smuzhiyun
39*4882a593Smuzhiyuncheck_interval
40*4882a593Smuzhiyun	How often to poll for corrected machine check errors, in seconds
41*4882a593Smuzhiyun	(Note output is hexadecimal). Default 5 minutes.  When the poller
42*4882a593Smuzhiyun	finds MCEs it triggers an exponential speedup (poll more often) on
43*4882a593Smuzhiyun	the polling interval.  When the poller stops finding MCEs, it
44*4882a593Smuzhiyun	triggers an exponential backoff (poll less often) on the polling
45*4882a593Smuzhiyun	interval. The check_interval variable is both the initial and
46*4882a593Smuzhiyun	maximum polling interval. 0 means no polling for corrected machine
47*4882a593Smuzhiyun	check errors (but some corrected errors might be still reported
48*4882a593Smuzhiyun	in other ways)
49*4882a593Smuzhiyun
50*4882a593Smuzhiyuntolerant
51*4882a593Smuzhiyun	Tolerance level. When a machine check exception occurs for a non
52*4882a593Smuzhiyun	corrected machine check the kernel can take different actions.
53*4882a593Smuzhiyun	Since machine check exceptions can happen any time it is sometimes
54*4882a593Smuzhiyun	risky for the kernel to kill a process because it defies
55*4882a593Smuzhiyun	normal kernel locking rules. The tolerance level configures
56*4882a593Smuzhiyun	how hard the kernel tries to recover even at some risk of
57*4882a593Smuzhiyun	deadlock.  Higher tolerant values trade potentially better uptime
58*4882a593Smuzhiyun	with the risk of a crash or even corruption (for tolerant >= 3).
59*4882a593Smuzhiyun
60*4882a593Smuzhiyun	0: always panic on uncorrected errors, log corrected errors
61*4882a593Smuzhiyun	1: panic or SIGBUS on uncorrected errors, log corrected errors
62*4882a593Smuzhiyun	2: SIGBUS or log uncorrected errors, log corrected errors
63*4882a593Smuzhiyun	3: never panic or SIGBUS, log all errors (for testing only)
64*4882a593Smuzhiyun
65*4882a593Smuzhiyun	Default: 1
66*4882a593Smuzhiyun
67*4882a593Smuzhiyun	Note this only makes a difference if the CPU allows recovery
68*4882a593Smuzhiyun	from a machine check exception. Current x86 CPUs generally do not.
69*4882a593Smuzhiyun
70*4882a593Smuzhiyuntrigger
71*4882a593Smuzhiyun	Program to run when a machine check event is detected.
72*4882a593Smuzhiyun	This is an alternative to running mcelog regularly from cron
73*4882a593Smuzhiyun	and allows to detect events faster.
74*4882a593Smuzhiyunmonarch_timeout
75*4882a593Smuzhiyun	How long to wait for the other CPUs to machine check too on a
76*4882a593Smuzhiyun	exception. 0 to disable waiting for other CPUs.
77*4882a593Smuzhiyun	Unit: us
78*4882a593Smuzhiyun
79*4882a593SmuzhiyunTBD document entries for AMD threshold interrupt configuration
80*4882a593Smuzhiyun
81*4882a593SmuzhiyunFor more details about the x86 machine check architecture
82*4882a593Smuzhiyunsee the Intel and AMD architecture manuals from their developer websites.
83*4882a593Smuzhiyun
84*4882a593SmuzhiyunFor more details about the architecture
85*4882a593Smuzhiyunsee http://one.firstfloor.org/~andi/mce.pdf
86