xref: /OK3568_Linux_fs/kernel/Documentation/watchdog/watchdog-api.rst (revision 4882a59341e53eb6f0b4789bf948001014eff981)
1*4882a593Smuzhiyun=============================
2*4882a593SmuzhiyunThe Linux Watchdog driver API
3*4882a593Smuzhiyun=============================
4*4882a593Smuzhiyun
5*4882a593SmuzhiyunLast reviewed: 10/05/2007
6*4882a593Smuzhiyun
7*4882a593Smuzhiyun
8*4882a593Smuzhiyun
9*4882a593SmuzhiyunCopyright 2002 Christer Weingel <wingel@nano-system.com>
10*4882a593Smuzhiyun
11*4882a593SmuzhiyunSome parts of this document are copied verbatim from the sbc60xxwdt
12*4882a593Smuzhiyundriver which is (c) Copyright 2000 Jakob Oestergaard <jakob@ostenfeld.dk>
13*4882a593Smuzhiyun
14*4882a593SmuzhiyunThis document describes the state of the Linux 2.4.18 kernel.
15*4882a593Smuzhiyun
16*4882a593SmuzhiyunIntroduction
17*4882a593Smuzhiyun============
18*4882a593Smuzhiyun
19*4882a593SmuzhiyunA Watchdog Timer (WDT) is a hardware circuit that can reset the
20*4882a593Smuzhiyuncomputer system in case of a software fault.  You probably knew that
21*4882a593Smuzhiyunalready.
22*4882a593Smuzhiyun
23*4882a593SmuzhiyunUsually a userspace daemon will notify the kernel watchdog driver via the
24*4882a593Smuzhiyun/dev/watchdog special device file that userspace is still alive, at
25*4882a593Smuzhiyunregular intervals.  When such a notification occurs, the driver will
26*4882a593Smuzhiyunusually tell the hardware watchdog that everything is in order, and
27*4882a593Smuzhiyunthat the watchdog should wait for yet another little while to reset
28*4882a593Smuzhiyunthe system.  If userspace fails (RAM error, kernel bug, whatever), the
29*4882a593Smuzhiyunnotifications cease to occur, and the hardware watchdog will reset the
30*4882a593Smuzhiyunsystem (causing a reboot) after the timeout occurs.
31*4882a593Smuzhiyun
32*4882a593SmuzhiyunThe Linux watchdog API is a rather ad-hoc construction and different
33*4882a593Smuzhiyundrivers implement different, and sometimes incompatible, parts of it.
34*4882a593SmuzhiyunThis file is an attempt to document the existing usage and allow
35*4882a593Smuzhiyunfuture driver writers to use it as a reference.
36*4882a593Smuzhiyun
37*4882a593SmuzhiyunThe simplest API
38*4882a593Smuzhiyun================
39*4882a593Smuzhiyun
40*4882a593SmuzhiyunAll drivers support the basic mode of operation, where the watchdog
41*4882a593Smuzhiyunactivates as soon as /dev/watchdog is opened and will reboot unless
42*4882a593Smuzhiyunthe watchdog is pinged within a certain time, this time is called the
43*4882a593Smuzhiyuntimeout or margin.  The simplest way to ping the watchdog is to write
44*4882a593Smuzhiyunsome data to the device.  So a very simple watchdog daemon would look
45*4882a593Smuzhiyunlike this source file:  see samples/watchdog/watchdog-simple.c
46*4882a593Smuzhiyun
47*4882a593SmuzhiyunA more advanced driver could for example check that a HTTP server is
48*4882a593Smuzhiyunstill responding before doing the write call to ping the watchdog.
49*4882a593Smuzhiyun
50*4882a593SmuzhiyunWhen the device is closed, the watchdog is disabled, unless the "Magic
51*4882a593SmuzhiyunClose" feature is supported (see below).  This is not always such a
52*4882a593Smuzhiyungood idea, since if there is a bug in the watchdog daemon and it
53*4882a593Smuzhiyuncrashes the system will not reboot.  Because of this, some of the
54*4882a593Smuzhiyundrivers support the configuration option "Disable watchdog shutdown on
55*4882a593Smuzhiyunclose", CONFIG_WATCHDOG_NOWAYOUT.  If it is set to Y when compiling
56*4882a593Smuzhiyunthe kernel, there is no way of disabling the watchdog once it has been
57*4882a593Smuzhiyunstarted.  So, if the watchdog daemon crashes, the system will reboot
58*4882a593Smuzhiyunafter the timeout has passed. Watchdog devices also usually support
59*4882a593Smuzhiyunthe nowayout module parameter so that this option can be controlled at
60*4882a593Smuzhiyunruntime.
61*4882a593Smuzhiyun
62*4882a593SmuzhiyunMagic Close feature
63*4882a593Smuzhiyun===================
64*4882a593Smuzhiyun
65*4882a593SmuzhiyunIf a driver supports "Magic Close", the driver will not disable the
66*4882a593Smuzhiyunwatchdog unless a specific magic character 'V' has been sent to
67*4882a593Smuzhiyun/dev/watchdog just before closing the file.  If the userspace daemon
68*4882a593Smuzhiyuncloses the file without sending this special character, the driver
69*4882a593Smuzhiyunwill assume that the daemon (and userspace in general) died, and will
70*4882a593Smuzhiyunstop pinging the watchdog without disabling it first.  This will then
71*4882a593Smuzhiyuncause a reboot if the watchdog is not re-opened in sufficient time.
72*4882a593Smuzhiyun
73*4882a593SmuzhiyunThe ioctl API
74*4882a593Smuzhiyun=============
75*4882a593Smuzhiyun
76*4882a593SmuzhiyunAll conforming drivers also support an ioctl API.
77*4882a593Smuzhiyun
78*4882a593SmuzhiyunPinging the watchdog using an ioctl:
79*4882a593Smuzhiyun
80*4882a593SmuzhiyunAll drivers that have an ioctl interface support at least one ioctl,
81*4882a593SmuzhiyunKEEPALIVE.  This ioctl does exactly the same thing as a write to the
82*4882a593Smuzhiyunwatchdog device, so the main loop in the above program could be
83*4882a593Smuzhiyunreplaced with::
84*4882a593Smuzhiyun
85*4882a593Smuzhiyun	while (1) {
86*4882a593Smuzhiyun		ioctl(fd, WDIOC_KEEPALIVE, 0);
87*4882a593Smuzhiyun		sleep(10);
88*4882a593Smuzhiyun	}
89*4882a593Smuzhiyun
90*4882a593Smuzhiyunthe argument to the ioctl is ignored.
91*4882a593Smuzhiyun
92*4882a593SmuzhiyunSetting and getting the timeout
93*4882a593Smuzhiyun===============================
94*4882a593Smuzhiyun
95*4882a593SmuzhiyunFor some drivers it is possible to modify the watchdog timeout on the
96*4882a593Smuzhiyunfly with the SETTIMEOUT ioctl, those drivers have the WDIOF_SETTIMEOUT
97*4882a593Smuzhiyunflag set in their option field.  The argument is an integer
98*4882a593Smuzhiyunrepresenting the timeout in seconds.  The driver returns the real
99*4882a593Smuzhiyuntimeout used in the same variable, and this timeout might differ from
100*4882a593Smuzhiyunthe requested one due to limitation of the hardware::
101*4882a593Smuzhiyun
102*4882a593Smuzhiyun    int timeout = 45;
103*4882a593Smuzhiyun    ioctl(fd, WDIOC_SETTIMEOUT, &timeout);
104*4882a593Smuzhiyun    printf("The timeout was set to %d seconds\n", timeout);
105*4882a593Smuzhiyun
106*4882a593SmuzhiyunThis example might actually print "The timeout was set to 60 seconds"
107*4882a593Smuzhiyunif the device has a granularity of minutes for its timeout.
108*4882a593Smuzhiyun
109*4882a593SmuzhiyunStarting with the Linux 2.4.18 kernel, it is possible to query the
110*4882a593Smuzhiyuncurrent timeout using the GETTIMEOUT ioctl::
111*4882a593Smuzhiyun
112*4882a593Smuzhiyun    ioctl(fd, WDIOC_GETTIMEOUT, &timeout);
113*4882a593Smuzhiyun    printf("The timeout was is %d seconds\n", timeout);
114*4882a593Smuzhiyun
115*4882a593SmuzhiyunPretimeouts
116*4882a593Smuzhiyun===========
117*4882a593Smuzhiyun
118*4882a593SmuzhiyunSome watchdog timers can be set to have a trigger go off before the
119*4882a593Smuzhiyunactual time they will reset the system.  This can be done with an NMI,
120*4882a593Smuzhiyuninterrupt, or other mechanism.  This allows Linux to record useful
121*4882a593Smuzhiyuninformation (like panic information and kernel coredumps) before it
122*4882a593Smuzhiyunresets::
123*4882a593Smuzhiyun
124*4882a593Smuzhiyun    pretimeout = 10;
125*4882a593Smuzhiyun    ioctl(fd, WDIOC_SETPRETIMEOUT, &pretimeout);
126*4882a593Smuzhiyun
127*4882a593SmuzhiyunNote that the pretimeout is the number of seconds before the time
128*4882a593Smuzhiyunwhen the timeout will go off.  It is not the number of seconds until
129*4882a593Smuzhiyunthe pretimeout.  So, for instance, if you set the timeout to 60 seconds
130*4882a593Smuzhiyunand the pretimeout to 10 seconds, the pretimeout will go off in 50
131*4882a593Smuzhiyunseconds.  Setting a pretimeout to zero disables it.
132*4882a593Smuzhiyun
133*4882a593SmuzhiyunThere is also a get function for getting the pretimeout::
134*4882a593Smuzhiyun
135*4882a593Smuzhiyun    ioctl(fd, WDIOC_GETPRETIMEOUT, &timeout);
136*4882a593Smuzhiyun    printf("The pretimeout was is %d seconds\n", timeout);
137*4882a593Smuzhiyun
138*4882a593SmuzhiyunNot all watchdog drivers will support a pretimeout.
139*4882a593Smuzhiyun
140*4882a593SmuzhiyunGet the number of seconds before reboot
141*4882a593Smuzhiyun=======================================
142*4882a593Smuzhiyun
143*4882a593SmuzhiyunSome watchdog drivers have the ability to report the remaining time
144*4882a593Smuzhiyunbefore the system will reboot. The WDIOC_GETTIMELEFT is the ioctl
145*4882a593Smuzhiyunthat returns the number of seconds before reboot::
146*4882a593Smuzhiyun
147*4882a593Smuzhiyun    ioctl(fd, WDIOC_GETTIMELEFT, &timeleft);
148*4882a593Smuzhiyun    printf("The timeout was is %d seconds\n", timeleft);
149*4882a593Smuzhiyun
150*4882a593SmuzhiyunEnvironmental monitoring
151*4882a593Smuzhiyun========================
152*4882a593Smuzhiyun
153*4882a593SmuzhiyunAll watchdog drivers are required return more information about the system,
154*4882a593Smuzhiyunsome do temperature, fan and power level monitoring, some can tell you
155*4882a593Smuzhiyunthe reason for the last reboot of the system.  The GETSUPPORT ioctl is
156*4882a593Smuzhiyunavailable to ask what the device can do::
157*4882a593Smuzhiyun
158*4882a593Smuzhiyun	struct watchdog_info ident;
159*4882a593Smuzhiyun	ioctl(fd, WDIOC_GETSUPPORT, &ident);
160*4882a593Smuzhiyun
161*4882a593Smuzhiyunthe fields returned in the ident struct are:
162*4882a593Smuzhiyun
163*4882a593Smuzhiyun	================	=============================================
164*4882a593Smuzhiyun        identity		a string identifying the watchdog driver
165*4882a593Smuzhiyun	firmware_version	the firmware version of the card if available
166*4882a593Smuzhiyun	options			a flags describing what the device supports
167*4882a593Smuzhiyun	================	=============================================
168*4882a593Smuzhiyun
169*4882a593Smuzhiyunthe options field can have the following bits set, and describes what
170*4882a593Smuzhiyunkind of information that the GET_STATUS and GET_BOOT_STATUS ioctls can
171*4882a593Smuzhiyunreturn.
172*4882a593Smuzhiyun
173*4882a593Smuzhiyun	================	=========================
174*4882a593Smuzhiyun	WDIOF_OVERHEAT		Reset due to CPU overheat
175*4882a593Smuzhiyun	================	=========================
176*4882a593Smuzhiyun
177*4882a593SmuzhiyunThe machine was last rebooted by the watchdog because the thermal limit was
178*4882a593Smuzhiyunexceeded:
179*4882a593Smuzhiyun
180*4882a593Smuzhiyun	==============		==========
181*4882a593Smuzhiyun	WDIOF_FANFAULT		Fan failed
182*4882a593Smuzhiyun	==============		==========
183*4882a593Smuzhiyun
184*4882a593SmuzhiyunA system fan monitored by the watchdog card has failed
185*4882a593Smuzhiyun
186*4882a593Smuzhiyun	=============		================
187*4882a593Smuzhiyun	WDIOF_EXTERN1		External relay 1
188*4882a593Smuzhiyun	=============		================
189*4882a593Smuzhiyun
190*4882a593SmuzhiyunExternal monitoring relay/source 1 was triggered. Controllers intended for
191*4882a593Smuzhiyunreal world applications include external monitoring pins that will trigger
192*4882a593Smuzhiyuna reset.
193*4882a593Smuzhiyun
194*4882a593Smuzhiyun	=============		================
195*4882a593Smuzhiyun	WDIOF_EXTERN2		External relay 2
196*4882a593Smuzhiyun	=============		================
197*4882a593Smuzhiyun
198*4882a593SmuzhiyunExternal monitoring relay/source 2 was triggered
199*4882a593Smuzhiyun
200*4882a593Smuzhiyun	================	=====================
201*4882a593Smuzhiyun	WDIOF_POWERUNDER	Power bad/power fault
202*4882a593Smuzhiyun	================	=====================
203*4882a593Smuzhiyun
204*4882a593SmuzhiyunThe machine is showing an undervoltage status
205*4882a593Smuzhiyun
206*4882a593Smuzhiyun	===============		=============================
207*4882a593Smuzhiyun	WDIOF_CARDRESET		Card previously reset the CPU
208*4882a593Smuzhiyun	===============		=============================
209*4882a593Smuzhiyun
210*4882a593SmuzhiyunThe last reboot was caused by the watchdog card
211*4882a593Smuzhiyun
212*4882a593Smuzhiyun	================	=====================
213*4882a593Smuzhiyun	WDIOF_POWEROVER		Power over voltage
214*4882a593Smuzhiyun	================	=====================
215*4882a593Smuzhiyun
216*4882a593SmuzhiyunThe machine is showing an overvoltage status. Note that if one level is
217*4882a593Smuzhiyununder and one over both bits will be set - this may seem odd but makes
218*4882a593Smuzhiyunsense.
219*4882a593Smuzhiyun
220*4882a593Smuzhiyun	===================	=====================
221*4882a593Smuzhiyun	WDIOF_KEEPALIVEPING	Keep alive ping reply
222*4882a593Smuzhiyun	===================	=====================
223*4882a593Smuzhiyun
224*4882a593SmuzhiyunThe watchdog saw a keepalive ping since it was last queried.
225*4882a593Smuzhiyun
226*4882a593Smuzhiyun	================	=======================
227*4882a593Smuzhiyun	WDIOF_SETTIMEOUT	Can set/get the timeout
228*4882a593Smuzhiyun	================	=======================
229*4882a593Smuzhiyun
230*4882a593SmuzhiyunThe watchdog can do pretimeouts.
231*4882a593Smuzhiyun
232*4882a593Smuzhiyun	================	================================
233*4882a593Smuzhiyun	WDIOF_PRETIMEOUT	Pretimeout (in seconds), get/set
234*4882a593Smuzhiyun	================	================================
235*4882a593Smuzhiyun
236*4882a593Smuzhiyun
237*4882a593SmuzhiyunFor those drivers that return any bits set in the option field, the
238*4882a593SmuzhiyunGETSTATUS and GETBOOTSTATUS ioctls can be used to ask for the current
239*4882a593Smuzhiyunstatus, and the status at the last reboot, respectively::
240*4882a593Smuzhiyun
241*4882a593Smuzhiyun    int flags;
242*4882a593Smuzhiyun    ioctl(fd, WDIOC_GETSTATUS, &flags);
243*4882a593Smuzhiyun
244*4882a593Smuzhiyun    or
245*4882a593Smuzhiyun
246*4882a593Smuzhiyun    ioctl(fd, WDIOC_GETBOOTSTATUS, &flags);
247*4882a593Smuzhiyun
248*4882a593SmuzhiyunNote that not all devices support these two calls, and some only
249*4882a593Smuzhiyunsupport the GETBOOTSTATUS call.
250*4882a593Smuzhiyun
251*4882a593SmuzhiyunSome drivers can measure the temperature using the GETTEMP ioctl.  The
252*4882a593Smuzhiyunreturned value is the temperature in degrees fahrenheit::
253*4882a593Smuzhiyun
254*4882a593Smuzhiyun    int temperature;
255*4882a593Smuzhiyun    ioctl(fd, WDIOC_GETTEMP, &temperature);
256*4882a593Smuzhiyun
257*4882a593SmuzhiyunFinally the SETOPTIONS ioctl can be used to control some aspects of
258*4882a593Smuzhiyunthe cards operation::
259*4882a593Smuzhiyun
260*4882a593Smuzhiyun    int options = 0;
261*4882a593Smuzhiyun    ioctl(fd, WDIOC_SETOPTIONS, &options);
262*4882a593Smuzhiyun
263*4882a593SmuzhiyunThe following options are available:
264*4882a593Smuzhiyun
265*4882a593Smuzhiyun	=================	================================
266*4882a593Smuzhiyun	WDIOS_DISABLECARD	Turn off the watchdog timer
267*4882a593Smuzhiyun	WDIOS_ENABLECARD	Turn on the watchdog timer
268*4882a593Smuzhiyun	WDIOS_TEMPPANIC		Kernel panic on temperature trip
269*4882a593Smuzhiyun	=================	================================
270*4882a593Smuzhiyun
271*4882a593Smuzhiyun[FIXME -- better explanations]
272