1*4882a593Smuzhiyun.. SPDX-License-Identifier: GPL-2.0 2*4882a593Smuzhiyun 3*4882a593Smuzhiyun============== 4*4882a593SmuzhiyunDevlink Health 5*4882a593Smuzhiyun============== 6*4882a593Smuzhiyun 7*4882a593SmuzhiyunBackground 8*4882a593Smuzhiyun========== 9*4882a593Smuzhiyun 10*4882a593SmuzhiyunThe ``devlink`` health mechanism is targeted for Real Time Alerting, in 11*4882a593Smuzhiyunorder to know when something bad happened to a PCI device. 12*4882a593Smuzhiyun 13*4882a593Smuzhiyun * Provide alert debug information. 14*4882a593Smuzhiyun * Self healing. 15*4882a593Smuzhiyun * If problem needs vendor support, provide a way to gather all needed 16*4882a593Smuzhiyun debugging information. 17*4882a593Smuzhiyun 18*4882a593SmuzhiyunOverview 19*4882a593Smuzhiyun======== 20*4882a593Smuzhiyun 21*4882a593SmuzhiyunThe main idea is to unify and centralize driver health reports in the 22*4882a593Smuzhiyungeneric ``devlink`` instance and allow the user to set different 23*4882a593Smuzhiyunattributes of the health reporting and recovery procedures. 24*4882a593Smuzhiyun 25*4882a593SmuzhiyunThe ``devlink`` health reporter: 26*4882a593SmuzhiyunDevice driver creates a "health reporter" per each error/health type. 27*4882a593SmuzhiyunError/Health type can be a known/generic (eg pci error, fw error, rx/tx error) 28*4882a593Smuzhiyunor unknown (driver specific). 29*4882a593SmuzhiyunFor each registered health reporter a driver can issue error/health reports 30*4882a593Smuzhiyunasynchronously. All health reports handling is done by ``devlink``. 31*4882a593SmuzhiyunDevice driver can provide specific callbacks for each "health reporter", e.g.: 32*4882a593Smuzhiyun 33*4882a593Smuzhiyun * Recovery procedures 34*4882a593Smuzhiyun * Diagnostics procedures 35*4882a593Smuzhiyun * Object dump procedures 36*4882a593Smuzhiyun * OOB initial parameters 37*4882a593Smuzhiyun 38*4882a593SmuzhiyunDifferent parts of the driver can register different types of health reporters 39*4882a593Smuzhiyunwith different handlers. 40*4882a593Smuzhiyun 41*4882a593SmuzhiyunActions 42*4882a593Smuzhiyun======= 43*4882a593Smuzhiyun 44*4882a593SmuzhiyunOnce an error is reported, devlink health will perform the following actions: 45*4882a593Smuzhiyun 46*4882a593Smuzhiyun * A log is being send to the kernel trace events buffer 47*4882a593Smuzhiyun * Health status and statistics are being updated for the reporter instance 48*4882a593Smuzhiyun * Object dump is being taken and saved at the reporter instance (as long as 49*4882a593Smuzhiyun there is no other dump which is already stored) 50*4882a593Smuzhiyun * Auto recovery attempt is being done. Depends on: 51*4882a593Smuzhiyun - Auto-recovery configuration 52*4882a593Smuzhiyun - Grace period vs. time passed since last recover 53*4882a593Smuzhiyun 54*4882a593SmuzhiyunUser Interface 55*4882a593Smuzhiyun============== 56*4882a593Smuzhiyun 57*4882a593SmuzhiyunUser can access/change each reporter's parameters and driver specific callbacks 58*4882a593Smuzhiyunvia ``devlink``, e.g per error type (per health reporter): 59*4882a593Smuzhiyun 60*4882a593Smuzhiyun * Configure reporter's generic parameters (like: disable/enable auto recovery) 61*4882a593Smuzhiyun * Invoke recovery procedure 62*4882a593Smuzhiyun * Run diagnostics 63*4882a593Smuzhiyun * Object dump 64*4882a593Smuzhiyun 65*4882a593Smuzhiyun.. list-table:: List of devlink health interfaces 66*4882a593Smuzhiyun :widths: 10 90 67*4882a593Smuzhiyun 68*4882a593Smuzhiyun * - Name 69*4882a593Smuzhiyun - Description 70*4882a593Smuzhiyun * - ``DEVLINK_CMD_HEALTH_REPORTER_GET`` 71*4882a593Smuzhiyun - Retrieves status and configuration info per DEV and reporter. 72*4882a593Smuzhiyun * - ``DEVLINK_CMD_HEALTH_REPORTER_SET`` 73*4882a593Smuzhiyun - Allows reporter-related configuration setting. 74*4882a593Smuzhiyun * - ``DEVLINK_CMD_HEALTH_REPORTER_RECOVER`` 75*4882a593Smuzhiyun - Triggers a reporter's recovery procedure. 76*4882a593Smuzhiyun * - ``DEVLINK_CMD_HEALTH_REPORTER_DIAGNOSE`` 77*4882a593Smuzhiyun - Retrieves diagnostics data from a reporter on a device. 78*4882a593Smuzhiyun * - ``DEVLINK_CMD_HEALTH_REPORTER_DUMP_GET`` 79*4882a593Smuzhiyun - Retrieves the last stored dump. Devlink health 80*4882a593Smuzhiyun saves a single dump. If an dump is not already stored by the devlink 81*4882a593Smuzhiyun for this reporter, devlink generates a new dump. 82*4882a593Smuzhiyun dump output is defined by the reporter. 83*4882a593Smuzhiyun * - ``DEVLINK_CMD_HEALTH_REPORTER_DUMP_CLEAR`` 84*4882a593Smuzhiyun - Clears the last saved dump file for the specified reporter. 85*4882a593Smuzhiyun 86*4882a593SmuzhiyunThe following diagram provides a general overview of ``devlink-health``:: 87*4882a593Smuzhiyun 88*4882a593Smuzhiyun netlink 89*4882a593Smuzhiyun +--------------------------+ 90*4882a593Smuzhiyun | | 91*4882a593Smuzhiyun | + | 92*4882a593Smuzhiyun | | | 93*4882a593Smuzhiyun +--------------------------+ 94*4882a593Smuzhiyun |request for ops 95*4882a593Smuzhiyun |(diagnose, 96*4882a593Smuzhiyun mlx5_core devlink |recover, 97*4882a593Smuzhiyun |dump) 98*4882a593Smuzhiyun +--------+ +--------------------------+ 99*4882a593Smuzhiyun | | | reporter| | 100*4882a593Smuzhiyun | | | +---------v----------+ | 101*4882a593Smuzhiyun | | ops execution | | | | 102*4882a593Smuzhiyun | <----------------------------------+ | | 103*4882a593Smuzhiyun | | | | | | 104*4882a593Smuzhiyun | | | + ^------------------+ | 105*4882a593Smuzhiyun | | | | request for ops | 106*4882a593Smuzhiyun | | | | (recover, dump) | 107*4882a593Smuzhiyun | | | | | 108*4882a593Smuzhiyun | | | +-+------------------+ | 109*4882a593Smuzhiyun | | health report | | health handler | | 110*4882a593Smuzhiyun | +-------------------------------> | | 111*4882a593Smuzhiyun | | | +--------------------+ | 112*4882a593Smuzhiyun | | health reporter create | | 113*4882a593Smuzhiyun | +----------------------------> | 114*4882a593Smuzhiyun +--------+ +--------------------------+ 115