1*4882a593Smuzhiyun.. SPDX-License-Identifier: GPL-2.0 2*4882a593Smuzhiyun 3*4882a593Smuzhiyun=============================================================== 4*4882a593SmuzhiyunConfigurable sysfs parameters for the x86-64 machine check code 5*4882a593Smuzhiyun=============================================================== 6*4882a593Smuzhiyun 7*4882a593SmuzhiyunMachine checks report internal hardware error conditions detected 8*4882a593Smuzhiyunby the CPU. Uncorrected errors typically cause a machine check 9*4882a593Smuzhiyun(often with panic), corrected ones cause a machine check log entry. 10*4882a593Smuzhiyun 11*4882a593SmuzhiyunMachine checks are organized in banks (normally associated with 12*4882a593Smuzhiyuna hardware subsystem) and subevents in a bank. The exact meaning 13*4882a593Smuzhiyunof the banks and subevent is CPU specific. 14*4882a593Smuzhiyun 15*4882a593Smuzhiyunmcelog knows how to decode them. 16*4882a593Smuzhiyun 17*4882a593SmuzhiyunWhen you see the "Machine check errors logged" message in the system 18*4882a593Smuzhiyunlog then mcelog should run to collect and decode machine check entries 19*4882a593Smuzhiyunfrom /dev/mcelog. Normally mcelog should be run regularly from a cronjob. 20*4882a593Smuzhiyun 21*4882a593SmuzhiyunEach CPU has a directory in /sys/devices/system/machinecheck/machinecheckN 22*4882a593Smuzhiyun(N = CPU number). 23*4882a593Smuzhiyun 24*4882a593SmuzhiyunThe directory contains some configurable entries: 25*4882a593Smuzhiyun 26*4882a593SmuzhiyunbankNctl 27*4882a593Smuzhiyun (N bank number) 28*4882a593Smuzhiyun 29*4882a593Smuzhiyun 64bit Hex bitmask enabling/disabling specific subevents for bank N 30*4882a593Smuzhiyun When a bit in the bitmask is zero then the respective 31*4882a593Smuzhiyun subevent will not be reported. 32*4882a593Smuzhiyun By default all events are enabled. 33*4882a593Smuzhiyun Note that BIOS maintain another mask to disable specific events 34*4882a593Smuzhiyun per bank. This is not visible here 35*4882a593Smuzhiyun 36*4882a593SmuzhiyunThe following entries appear for each CPU, but they are truly shared 37*4882a593Smuzhiyunbetween all CPUs. 38*4882a593Smuzhiyun 39*4882a593Smuzhiyuncheck_interval 40*4882a593Smuzhiyun How often to poll for corrected machine check errors, in seconds 41*4882a593Smuzhiyun (Note output is hexadecimal). Default 5 minutes. When the poller 42*4882a593Smuzhiyun finds MCEs it triggers an exponential speedup (poll more often) on 43*4882a593Smuzhiyun the polling interval. When the poller stops finding MCEs, it 44*4882a593Smuzhiyun triggers an exponential backoff (poll less often) on the polling 45*4882a593Smuzhiyun interval. The check_interval variable is both the initial and 46*4882a593Smuzhiyun maximum polling interval. 0 means no polling for corrected machine 47*4882a593Smuzhiyun check errors (but some corrected errors might be still reported 48*4882a593Smuzhiyun in other ways) 49*4882a593Smuzhiyun 50*4882a593Smuzhiyuntolerant 51*4882a593Smuzhiyun Tolerance level. When a machine check exception occurs for a non 52*4882a593Smuzhiyun corrected machine check the kernel can take different actions. 53*4882a593Smuzhiyun Since machine check exceptions can happen any time it is sometimes 54*4882a593Smuzhiyun risky for the kernel to kill a process because it defies 55*4882a593Smuzhiyun normal kernel locking rules. The tolerance level configures 56*4882a593Smuzhiyun how hard the kernel tries to recover even at some risk of 57*4882a593Smuzhiyun deadlock. Higher tolerant values trade potentially better uptime 58*4882a593Smuzhiyun with the risk of a crash or even corruption (for tolerant >= 3). 59*4882a593Smuzhiyun 60*4882a593Smuzhiyun 0: always panic on uncorrected errors, log corrected errors 61*4882a593Smuzhiyun 1: panic or SIGBUS on uncorrected errors, log corrected errors 62*4882a593Smuzhiyun 2: SIGBUS or log uncorrected errors, log corrected errors 63*4882a593Smuzhiyun 3: never panic or SIGBUS, log all errors (for testing only) 64*4882a593Smuzhiyun 65*4882a593Smuzhiyun Default: 1 66*4882a593Smuzhiyun 67*4882a593Smuzhiyun Note this only makes a difference if the CPU allows recovery 68*4882a593Smuzhiyun from a machine check exception. Current x86 CPUs generally do not. 69*4882a593Smuzhiyun 70*4882a593Smuzhiyuntrigger 71*4882a593Smuzhiyun Program to run when a machine check event is detected. 72*4882a593Smuzhiyun This is an alternative to running mcelog regularly from cron 73*4882a593Smuzhiyun and allows to detect events faster. 74*4882a593Smuzhiyunmonarch_timeout 75*4882a593Smuzhiyun How long to wait for the other CPUs to machine check too on a 76*4882a593Smuzhiyun exception. 0 to disable waiting for other CPUs. 77*4882a593Smuzhiyun Unit: us 78*4882a593Smuzhiyun 79*4882a593SmuzhiyunTBD document entries for AMD threshold interrupt configuration 80*4882a593Smuzhiyun 81*4882a593SmuzhiyunFor more details about the x86 machine check architecture 82*4882a593Smuzhiyunsee the Intel and AMD architecture manuals from their developer websites. 83*4882a593Smuzhiyun 84*4882a593SmuzhiyunFor more details about the architecture 85*4882a593Smuzhiyunsee http://one.firstfloor.org/~andi/mce.pdf 86