xref: /OK3568_Linux_fs/kernel/Documentation/networking/devlink/devlink-health.rst (revision 4882a59341e53eb6f0b4789bf948001014eff981)
1*4882a593Smuzhiyun.. SPDX-License-Identifier: GPL-2.0
2*4882a593Smuzhiyun
3*4882a593Smuzhiyun==============
4*4882a593SmuzhiyunDevlink Health
5*4882a593Smuzhiyun==============
6*4882a593Smuzhiyun
7*4882a593SmuzhiyunBackground
8*4882a593Smuzhiyun==========
9*4882a593Smuzhiyun
10*4882a593SmuzhiyunThe ``devlink`` health mechanism is targeted for Real Time Alerting, in
11*4882a593Smuzhiyunorder to know when something bad happened to a PCI device.
12*4882a593Smuzhiyun
13*4882a593Smuzhiyun  * Provide alert debug information.
14*4882a593Smuzhiyun  * Self healing.
15*4882a593Smuzhiyun  * If problem needs vendor support, provide a way to gather all needed
16*4882a593Smuzhiyun    debugging information.
17*4882a593Smuzhiyun
18*4882a593SmuzhiyunOverview
19*4882a593Smuzhiyun========
20*4882a593Smuzhiyun
21*4882a593SmuzhiyunThe main idea is to unify and centralize driver health reports in the
22*4882a593Smuzhiyungeneric ``devlink`` instance and allow the user to set different
23*4882a593Smuzhiyunattributes of the health reporting and recovery procedures.
24*4882a593Smuzhiyun
25*4882a593SmuzhiyunThe ``devlink`` health reporter:
26*4882a593SmuzhiyunDevice driver creates a "health reporter" per each error/health type.
27*4882a593SmuzhiyunError/Health type can be a known/generic (eg pci error, fw error, rx/tx error)
28*4882a593Smuzhiyunor unknown (driver specific).
29*4882a593SmuzhiyunFor each registered health reporter a driver can issue error/health reports
30*4882a593Smuzhiyunasynchronously. All health reports handling is done by ``devlink``.
31*4882a593SmuzhiyunDevice driver can provide specific callbacks for each "health reporter", e.g.:
32*4882a593Smuzhiyun
33*4882a593Smuzhiyun  * Recovery procedures
34*4882a593Smuzhiyun  * Diagnostics procedures
35*4882a593Smuzhiyun  * Object dump procedures
36*4882a593Smuzhiyun  * OOB initial parameters
37*4882a593Smuzhiyun
38*4882a593SmuzhiyunDifferent parts of the driver can register different types of health reporters
39*4882a593Smuzhiyunwith different handlers.
40*4882a593Smuzhiyun
41*4882a593SmuzhiyunActions
42*4882a593Smuzhiyun=======
43*4882a593Smuzhiyun
44*4882a593SmuzhiyunOnce an error is reported, devlink health will perform the following actions:
45*4882a593Smuzhiyun
46*4882a593Smuzhiyun  * A log is being send to the kernel trace events buffer
47*4882a593Smuzhiyun  * Health status and statistics are being updated for the reporter instance
48*4882a593Smuzhiyun  * Object dump is being taken and saved at the reporter instance (as long as
49*4882a593Smuzhiyun    there is no other dump which is already stored)
50*4882a593Smuzhiyun  * Auto recovery attempt is being done. Depends on:
51*4882a593Smuzhiyun    - Auto-recovery configuration
52*4882a593Smuzhiyun    - Grace period vs. time passed since last recover
53*4882a593Smuzhiyun
54*4882a593SmuzhiyunUser Interface
55*4882a593Smuzhiyun==============
56*4882a593Smuzhiyun
57*4882a593SmuzhiyunUser can access/change each reporter's parameters and driver specific callbacks
58*4882a593Smuzhiyunvia ``devlink``, e.g per error type (per health reporter):
59*4882a593Smuzhiyun
60*4882a593Smuzhiyun  * Configure reporter's generic parameters (like: disable/enable auto recovery)
61*4882a593Smuzhiyun  * Invoke recovery procedure
62*4882a593Smuzhiyun  * Run diagnostics
63*4882a593Smuzhiyun  * Object dump
64*4882a593Smuzhiyun
65*4882a593Smuzhiyun.. list-table:: List of devlink health interfaces
66*4882a593Smuzhiyun   :widths: 10 90
67*4882a593Smuzhiyun
68*4882a593Smuzhiyun   * - Name
69*4882a593Smuzhiyun     - Description
70*4882a593Smuzhiyun   * - ``DEVLINK_CMD_HEALTH_REPORTER_GET``
71*4882a593Smuzhiyun     - Retrieves status and configuration info per DEV and reporter.
72*4882a593Smuzhiyun   * - ``DEVLINK_CMD_HEALTH_REPORTER_SET``
73*4882a593Smuzhiyun     - Allows reporter-related configuration setting.
74*4882a593Smuzhiyun   * - ``DEVLINK_CMD_HEALTH_REPORTER_RECOVER``
75*4882a593Smuzhiyun     - Triggers a reporter's recovery procedure.
76*4882a593Smuzhiyun   * - ``DEVLINK_CMD_HEALTH_REPORTER_DIAGNOSE``
77*4882a593Smuzhiyun     - Retrieves diagnostics data from a reporter on a device.
78*4882a593Smuzhiyun   * - ``DEVLINK_CMD_HEALTH_REPORTER_DUMP_GET``
79*4882a593Smuzhiyun     - Retrieves the last stored dump. Devlink health
80*4882a593Smuzhiyun       saves a single dump. If an dump is not already stored by the devlink
81*4882a593Smuzhiyun       for this reporter, devlink generates a new dump.
82*4882a593Smuzhiyun       dump output is defined by the reporter.
83*4882a593Smuzhiyun   * - ``DEVLINK_CMD_HEALTH_REPORTER_DUMP_CLEAR``
84*4882a593Smuzhiyun     - Clears the last saved dump file for the specified reporter.
85*4882a593Smuzhiyun
86*4882a593SmuzhiyunThe following diagram provides a general overview of ``devlink-health``::
87*4882a593Smuzhiyun
88*4882a593Smuzhiyun                                                   netlink
89*4882a593Smuzhiyun                                          +--------------------------+
90*4882a593Smuzhiyun                                          |                          |
91*4882a593Smuzhiyun                                          |            +             |
92*4882a593Smuzhiyun                                          |            |             |
93*4882a593Smuzhiyun                                          +--------------------------+
94*4882a593Smuzhiyun                                                       |request for ops
95*4882a593Smuzhiyun                                                       |(diagnose,
96*4882a593Smuzhiyun     mlx5_core                             devlink     |recover,
97*4882a593Smuzhiyun                                                       |dump)
98*4882a593Smuzhiyun    +--------+                            +--------------------------+
99*4882a593Smuzhiyun    |        |                            |    reporter|             |
100*4882a593Smuzhiyun    |        |                            |  +---------v----------+  |
101*4882a593Smuzhiyun    |        |   ops execution            |  |                    |  |
102*4882a593Smuzhiyun    |     <----------------------------------+                    |  |
103*4882a593Smuzhiyun    |        |                            |  |                    |  |
104*4882a593Smuzhiyun    |        |                            |  + ^------------------+  |
105*4882a593Smuzhiyun    |        |                            |    | request for ops     |
106*4882a593Smuzhiyun    |        |                            |    | (recover, dump)     |
107*4882a593Smuzhiyun    |        |                            |    |                     |
108*4882a593Smuzhiyun    |        |                            |  +-+------------------+  |
109*4882a593Smuzhiyun    |        |     health report          |  | health handler     |  |
110*4882a593Smuzhiyun    |        +------------------------------->                    |  |
111*4882a593Smuzhiyun    |        |                            |  +--------------------+  |
112*4882a593Smuzhiyun    |        |     health reporter create |                          |
113*4882a593Smuzhiyun    |        +---------------------------->                          |
114*4882a593Smuzhiyun    +--------+                            +--------------------------+
115