x86/x86_64/machinecheck.rst

*4882a593Smuzhiyun.. SPDX-License-Identifier: GPL-2.0
*4882a593Smuzhiyun
*4882a593Smuzhiyun===============================================================
*4882a593SmuzhiyunConfigurable sysfs parameters for the x86-64 machine check code
*4882a593Smuzhiyun===============================================================
*4882a593Smuzhiyun
*4882a593SmuzhiyunMachine checks report internal hardware error conditions detected
*4882a593Smuzhiyunby the CPU. Uncorrected errors typically cause a machine check
*4882a593Smuzhiyun(often with panic), corrected ones cause a machine check log entry.
*4882a593Smuzhiyun
*4882a593SmuzhiyunMachine checks are organized in banks (normally associated with
*4882a593Smuzhiyuna hardware subsystem) and subevents in a bank. The exact meaning
*4882a593Smuzhiyunof the banks and subevent is CPU specific.
*4882a593Smuzhiyun
*4882a593Smuzhiyunmcelog knows how to decode them.
*4882a593Smuzhiyun
*4882a593SmuzhiyunWhen you see the "Machine check errors logged" message in the system
*4882a593Smuzhiyunlog then mcelog should run to collect and decode machine check entries
*4882a593Smuzhiyunfrom /dev/mcelog. Normally mcelog should be run regularly from a cronjob.
*4882a593Smuzhiyun
*4882a593SmuzhiyunEach CPU has a directory in /sys/devices/system/machinecheck/machinecheckN
*4882a593Smuzhiyun(N = CPU number).
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe directory contains some configurable entries:
*4882a593Smuzhiyun
*4882a593SmuzhiyunbankNctl
*4882a593Smuzhiyun	(N bank number)
*4882a593Smuzhiyun
*4882a593Smuzhiyun	64bit Hex bitmask enabling/disabling specific subevents for bank N
*4882a593Smuzhiyun	When a bit in the bitmask is zero then the respective
*4882a593Smuzhiyun	subevent will not be reported.
*4882a593Smuzhiyun	By default all events are enabled.
*4882a593Smuzhiyun	Note that BIOS maintain another mask to disable specific events
*4882a593Smuzhiyun	per bank.  This is not visible here
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe following entries appear for each CPU, but they are truly shared
*4882a593Smuzhiyunbetween all CPUs.
*4882a593Smuzhiyun
*4882a593Smuzhiyuncheck_interval
*4882a593Smuzhiyun	How often to poll for corrected machine check errors, in seconds
*4882a593Smuzhiyun	(Note output is hexadecimal). Default 5 minutes.  When the poller
*4882a593Smuzhiyun	finds MCEs it triggers an exponential speedup (poll more often) on
*4882a593Smuzhiyun	the polling interval.  When the poller stops finding MCEs, it
*4882a593Smuzhiyun	triggers an exponential backoff (poll less often) on the polling
*4882a593Smuzhiyun	interval. The check_interval variable is both the initial and
*4882a593Smuzhiyun	maximum polling interval. 0 means no polling for corrected machine
*4882a593Smuzhiyun	check errors (but some corrected errors might be still reported
*4882a593Smuzhiyun	in other ways)
*4882a593Smuzhiyun
*4882a593Smuzhiyuntolerant
*4882a593Smuzhiyun	Tolerance level. When a machine check exception occurs for a non
*4882a593Smuzhiyun	corrected machine check the kernel can take different actions.
*4882a593Smuzhiyun	Since machine check exceptions can happen any time it is sometimes
*4882a593Smuzhiyun	risky for the kernel to kill a process because it defies
*4882a593Smuzhiyun	normal kernel locking rules. The tolerance level configures
*4882a593Smuzhiyun	how hard the kernel tries to recover even at some risk of
*4882a593Smuzhiyun	deadlock.  Higher tolerant values trade potentially better uptime
*4882a593Smuzhiyun	with the risk of a crash or even corruption (for tolerant >= 3).
*4882a593Smuzhiyun
*4882a593Smuzhiyun	0: always panic on uncorrected errors, log corrected errors
*4882a593Smuzhiyun	1: panic or SIGBUS on uncorrected errors, log corrected errors
*4882a593Smuzhiyun	2: SIGBUS or log uncorrected errors, log corrected errors
*4882a593Smuzhiyun	3: never panic or SIGBUS, log all errors (for testing only)
*4882a593Smuzhiyun
*4882a593Smuzhiyun	Default: 1
*4882a593Smuzhiyun
*4882a593Smuzhiyun	Note this only makes a difference if the CPU allows recovery
*4882a593Smuzhiyun	from a machine check exception. Current x86 CPUs generally do not.
*4882a593Smuzhiyun
*4882a593Smuzhiyuntrigger
*4882a593Smuzhiyun	Program to run when a machine check event is detected.
*4882a593Smuzhiyun	This is an alternative to running mcelog regularly from cron
*4882a593Smuzhiyun	and allows to detect events faster.
*4882a593Smuzhiyunmonarch_timeout
*4882a593Smuzhiyun	How long to wait for the other CPUs to machine check too on a
*4882a593Smuzhiyun	exception. 0 to disable waiting for other CPUs.
*4882a593Smuzhiyun	Unit: us
*4882a593Smuzhiyun
*4882a593SmuzhiyunTBD document entries for AMD threshold interrupt configuration
*4882a593Smuzhiyun
*4882a593SmuzhiyunFor more details about the x86 machine check architecture
*4882a593Smuzhiyunsee the Intel and AMD architecture manuals from their developer websites.
*4882a593Smuzhiyun
*4882a593SmuzhiyunFor more details about the architecture
*4882a593Smuzhiyunsee http://one.firstfloor.org/~andi/mce.pdf