1*4882a593Smuzhiyun============================= 2*4882a593SmuzhiyunThe Linux Watchdog driver API 3*4882a593Smuzhiyun============================= 4*4882a593Smuzhiyun 5*4882a593SmuzhiyunLast reviewed: 10/05/2007 6*4882a593Smuzhiyun 7*4882a593Smuzhiyun 8*4882a593Smuzhiyun 9*4882a593SmuzhiyunCopyright 2002 Christer Weingel <wingel@nano-system.com> 10*4882a593Smuzhiyun 11*4882a593SmuzhiyunSome parts of this document are copied verbatim from the sbc60xxwdt 12*4882a593Smuzhiyundriver which is (c) Copyright 2000 Jakob Oestergaard <jakob@ostenfeld.dk> 13*4882a593Smuzhiyun 14*4882a593SmuzhiyunThis document describes the state of the Linux 2.4.18 kernel. 15*4882a593Smuzhiyun 16*4882a593SmuzhiyunIntroduction 17*4882a593Smuzhiyun============ 18*4882a593Smuzhiyun 19*4882a593SmuzhiyunA Watchdog Timer (WDT) is a hardware circuit that can reset the 20*4882a593Smuzhiyuncomputer system in case of a software fault. You probably knew that 21*4882a593Smuzhiyunalready. 22*4882a593Smuzhiyun 23*4882a593SmuzhiyunUsually a userspace daemon will notify the kernel watchdog driver via the 24*4882a593Smuzhiyun/dev/watchdog special device file that userspace is still alive, at 25*4882a593Smuzhiyunregular intervals. When such a notification occurs, the driver will 26*4882a593Smuzhiyunusually tell the hardware watchdog that everything is in order, and 27*4882a593Smuzhiyunthat the watchdog should wait for yet another little while to reset 28*4882a593Smuzhiyunthe system. If userspace fails (RAM error, kernel bug, whatever), the 29*4882a593Smuzhiyunnotifications cease to occur, and the hardware watchdog will reset the 30*4882a593Smuzhiyunsystem (causing a reboot) after the timeout occurs. 31*4882a593Smuzhiyun 32*4882a593SmuzhiyunThe Linux watchdog API is a rather ad-hoc construction and different 33*4882a593Smuzhiyundrivers implement different, and sometimes incompatible, parts of it. 34*4882a593SmuzhiyunThis file is an attempt to document the existing usage and allow 35*4882a593Smuzhiyunfuture driver writers to use it as a reference. 36*4882a593Smuzhiyun 37*4882a593SmuzhiyunThe simplest API 38*4882a593Smuzhiyun================ 39*4882a593Smuzhiyun 40*4882a593SmuzhiyunAll drivers support the basic mode of operation, where the watchdog 41*4882a593Smuzhiyunactivates as soon as /dev/watchdog is opened and will reboot unless 42*4882a593Smuzhiyunthe watchdog is pinged within a certain time, this time is called the 43*4882a593Smuzhiyuntimeout or margin. The simplest way to ping the watchdog is to write 44*4882a593Smuzhiyunsome data to the device. So a very simple watchdog daemon would look 45*4882a593Smuzhiyunlike this source file: see samples/watchdog/watchdog-simple.c 46*4882a593Smuzhiyun 47*4882a593SmuzhiyunA more advanced driver could for example check that a HTTP server is 48*4882a593Smuzhiyunstill responding before doing the write call to ping the watchdog. 49*4882a593Smuzhiyun 50*4882a593SmuzhiyunWhen the device is closed, the watchdog is disabled, unless the "Magic 51*4882a593SmuzhiyunClose" feature is supported (see below). This is not always such a 52*4882a593Smuzhiyungood idea, since if there is a bug in the watchdog daemon and it 53*4882a593Smuzhiyuncrashes the system will not reboot. Because of this, some of the 54*4882a593Smuzhiyundrivers support the configuration option "Disable watchdog shutdown on 55*4882a593Smuzhiyunclose", CONFIG_WATCHDOG_NOWAYOUT. If it is set to Y when compiling 56*4882a593Smuzhiyunthe kernel, there is no way of disabling the watchdog once it has been 57*4882a593Smuzhiyunstarted. So, if the watchdog daemon crashes, the system will reboot 58*4882a593Smuzhiyunafter the timeout has passed. Watchdog devices also usually support 59*4882a593Smuzhiyunthe nowayout module parameter so that this option can be controlled at 60*4882a593Smuzhiyunruntime. 61*4882a593Smuzhiyun 62*4882a593SmuzhiyunMagic Close feature 63*4882a593Smuzhiyun=================== 64*4882a593Smuzhiyun 65*4882a593SmuzhiyunIf a driver supports "Magic Close", the driver will not disable the 66*4882a593Smuzhiyunwatchdog unless a specific magic character 'V' has been sent to 67*4882a593Smuzhiyun/dev/watchdog just before closing the file. If the userspace daemon 68*4882a593Smuzhiyuncloses the file without sending this special character, the driver 69*4882a593Smuzhiyunwill assume that the daemon (and userspace in general) died, and will 70*4882a593Smuzhiyunstop pinging the watchdog without disabling it first. This will then 71*4882a593Smuzhiyuncause a reboot if the watchdog is not re-opened in sufficient time. 72*4882a593Smuzhiyun 73*4882a593SmuzhiyunThe ioctl API 74*4882a593Smuzhiyun============= 75*4882a593Smuzhiyun 76*4882a593SmuzhiyunAll conforming drivers also support an ioctl API. 77*4882a593Smuzhiyun 78*4882a593SmuzhiyunPinging the watchdog using an ioctl: 79*4882a593Smuzhiyun 80*4882a593SmuzhiyunAll drivers that have an ioctl interface support at least one ioctl, 81*4882a593SmuzhiyunKEEPALIVE. This ioctl does exactly the same thing as a write to the 82*4882a593Smuzhiyunwatchdog device, so the main loop in the above program could be 83*4882a593Smuzhiyunreplaced with:: 84*4882a593Smuzhiyun 85*4882a593Smuzhiyun while (1) { 86*4882a593Smuzhiyun ioctl(fd, WDIOC_KEEPALIVE, 0); 87*4882a593Smuzhiyun sleep(10); 88*4882a593Smuzhiyun } 89*4882a593Smuzhiyun 90*4882a593Smuzhiyunthe argument to the ioctl is ignored. 91*4882a593Smuzhiyun 92*4882a593SmuzhiyunSetting and getting the timeout 93*4882a593Smuzhiyun=============================== 94*4882a593Smuzhiyun 95*4882a593SmuzhiyunFor some drivers it is possible to modify the watchdog timeout on the 96*4882a593Smuzhiyunfly with the SETTIMEOUT ioctl, those drivers have the WDIOF_SETTIMEOUT 97*4882a593Smuzhiyunflag set in their option field. The argument is an integer 98*4882a593Smuzhiyunrepresenting the timeout in seconds. The driver returns the real 99*4882a593Smuzhiyuntimeout used in the same variable, and this timeout might differ from 100*4882a593Smuzhiyunthe requested one due to limitation of the hardware:: 101*4882a593Smuzhiyun 102*4882a593Smuzhiyun int timeout = 45; 103*4882a593Smuzhiyun ioctl(fd, WDIOC_SETTIMEOUT, &timeout); 104*4882a593Smuzhiyun printf("The timeout was set to %d seconds\n", timeout); 105*4882a593Smuzhiyun 106*4882a593SmuzhiyunThis example might actually print "The timeout was set to 60 seconds" 107*4882a593Smuzhiyunif the device has a granularity of minutes for its timeout. 108*4882a593Smuzhiyun 109*4882a593SmuzhiyunStarting with the Linux 2.4.18 kernel, it is possible to query the 110*4882a593Smuzhiyuncurrent timeout using the GETTIMEOUT ioctl:: 111*4882a593Smuzhiyun 112*4882a593Smuzhiyun ioctl(fd, WDIOC_GETTIMEOUT, &timeout); 113*4882a593Smuzhiyun printf("The timeout was is %d seconds\n", timeout); 114*4882a593Smuzhiyun 115*4882a593SmuzhiyunPretimeouts 116*4882a593Smuzhiyun=========== 117*4882a593Smuzhiyun 118*4882a593SmuzhiyunSome watchdog timers can be set to have a trigger go off before the 119*4882a593Smuzhiyunactual time they will reset the system. This can be done with an NMI, 120*4882a593Smuzhiyuninterrupt, or other mechanism. This allows Linux to record useful 121*4882a593Smuzhiyuninformation (like panic information and kernel coredumps) before it 122*4882a593Smuzhiyunresets:: 123*4882a593Smuzhiyun 124*4882a593Smuzhiyun pretimeout = 10; 125*4882a593Smuzhiyun ioctl(fd, WDIOC_SETPRETIMEOUT, &pretimeout); 126*4882a593Smuzhiyun 127*4882a593SmuzhiyunNote that the pretimeout is the number of seconds before the time 128*4882a593Smuzhiyunwhen the timeout will go off. It is not the number of seconds until 129*4882a593Smuzhiyunthe pretimeout. So, for instance, if you set the timeout to 60 seconds 130*4882a593Smuzhiyunand the pretimeout to 10 seconds, the pretimeout will go off in 50 131*4882a593Smuzhiyunseconds. Setting a pretimeout to zero disables it. 132*4882a593Smuzhiyun 133*4882a593SmuzhiyunThere is also a get function for getting the pretimeout:: 134*4882a593Smuzhiyun 135*4882a593Smuzhiyun ioctl(fd, WDIOC_GETPRETIMEOUT, &timeout); 136*4882a593Smuzhiyun printf("The pretimeout was is %d seconds\n", timeout); 137*4882a593Smuzhiyun 138*4882a593SmuzhiyunNot all watchdog drivers will support a pretimeout. 139*4882a593Smuzhiyun 140*4882a593SmuzhiyunGet the number of seconds before reboot 141*4882a593Smuzhiyun======================================= 142*4882a593Smuzhiyun 143*4882a593SmuzhiyunSome watchdog drivers have the ability to report the remaining time 144*4882a593Smuzhiyunbefore the system will reboot. The WDIOC_GETTIMELEFT is the ioctl 145*4882a593Smuzhiyunthat returns the number of seconds before reboot:: 146*4882a593Smuzhiyun 147*4882a593Smuzhiyun ioctl(fd, WDIOC_GETTIMELEFT, &timeleft); 148*4882a593Smuzhiyun printf("The timeout was is %d seconds\n", timeleft); 149*4882a593Smuzhiyun 150*4882a593SmuzhiyunEnvironmental monitoring 151*4882a593Smuzhiyun======================== 152*4882a593Smuzhiyun 153*4882a593SmuzhiyunAll watchdog drivers are required return more information about the system, 154*4882a593Smuzhiyunsome do temperature, fan and power level monitoring, some can tell you 155*4882a593Smuzhiyunthe reason for the last reboot of the system. The GETSUPPORT ioctl is 156*4882a593Smuzhiyunavailable to ask what the device can do:: 157*4882a593Smuzhiyun 158*4882a593Smuzhiyun struct watchdog_info ident; 159*4882a593Smuzhiyun ioctl(fd, WDIOC_GETSUPPORT, &ident); 160*4882a593Smuzhiyun 161*4882a593Smuzhiyunthe fields returned in the ident struct are: 162*4882a593Smuzhiyun 163*4882a593Smuzhiyun ================ ============================================= 164*4882a593Smuzhiyun identity a string identifying the watchdog driver 165*4882a593Smuzhiyun firmware_version the firmware version of the card if available 166*4882a593Smuzhiyun options a flags describing what the device supports 167*4882a593Smuzhiyun ================ ============================================= 168*4882a593Smuzhiyun 169*4882a593Smuzhiyunthe options field can have the following bits set, and describes what 170*4882a593Smuzhiyunkind of information that the GET_STATUS and GET_BOOT_STATUS ioctls can 171*4882a593Smuzhiyunreturn. 172*4882a593Smuzhiyun 173*4882a593Smuzhiyun ================ ========================= 174*4882a593Smuzhiyun WDIOF_OVERHEAT Reset due to CPU overheat 175*4882a593Smuzhiyun ================ ========================= 176*4882a593Smuzhiyun 177*4882a593SmuzhiyunThe machine was last rebooted by the watchdog because the thermal limit was 178*4882a593Smuzhiyunexceeded: 179*4882a593Smuzhiyun 180*4882a593Smuzhiyun ============== ========== 181*4882a593Smuzhiyun WDIOF_FANFAULT Fan failed 182*4882a593Smuzhiyun ============== ========== 183*4882a593Smuzhiyun 184*4882a593SmuzhiyunA system fan monitored by the watchdog card has failed 185*4882a593Smuzhiyun 186*4882a593Smuzhiyun ============= ================ 187*4882a593Smuzhiyun WDIOF_EXTERN1 External relay 1 188*4882a593Smuzhiyun ============= ================ 189*4882a593Smuzhiyun 190*4882a593SmuzhiyunExternal monitoring relay/source 1 was triggered. Controllers intended for 191*4882a593Smuzhiyunreal world applications include external monitoring pins that will trigger 192*4882a593Smuzhiyuna reset. 193*4882a593Smuzhiyun 194*4882a593Smuzhiyun ============= ================ 195*4882a593Smuzhiyun WDIOF_EXTERN2 External relay 2 196*4882a593Smuzhiyun ============= ================ 197*4882a593Smuzhiyun 198*4882a593SmuzhiyunExternal monitoring relay/source 2 was triggered 199*4882a593Smuzhiyun 200*4882a593Smuzhiyun ================ ===================== 201*4882a593Smuzhiyun WDIOF_POWERUNDER Power bad/power fault 202*4882a593Smuzhiyun ================ ===================== 203*4882a593Smuzhiyun 204*4882a593SmuzhiyunThe machine is showing an undervoltage status 205*4882a593Smuzhiyun 206*4882a593Smuzhiyun =============== ============================= 207*4882a593Smuzhiyun WDIOF_CARDRESET Card previously reset the CPU 208*4882a593Smuzhiyun =============== ============================= 209*4882a593Smuzhiyun 210*4882a593SmuzhiyunThe last reboot was caused by the watchdog card 211*4882a593Smuzhiyun 212*4882a593Smuzhiyun ================ ===================== 213*4882a593Smuzhiyun WDIOF_POWEROVER Power over voltage 214*4882a593Smuzhiyun ================ ===================== 215*4882a593Smuzhiyun 216*4882a593SmuzhiyunThe machine is showing an overvoltage status. Note that if one level is 217*4882a593Smuzhiyununder and one over both bits will be set - this may seem odd but makes 218*4882a593Smuzhiyunsense. 219*4882a593Smuzhiyun 220*4882a593Smuzhiyun =================== ===================== 221*4882a593Smuzhiyun WDIOF_KEEPALIVEPING Keep alive ping reply 222*4882a593Smuzhiyun =================== ===================== 223*4882a593Smuzhiyun 224*4882a593SmuzhiyunThe watchdog saw a keepalive ping since it was last queried. 225*4882a593Smuzhiyun 226*4882a593Smuzhiyun ================ ======================= 227*4882a593Smuzhiyun WDIOF_SETTIMEOUT Can set/get the timeout 228*4882a593Smuzhiyun ================ ======================= 229*4882a593Smuzhiyun 230*4882a593SmuzhiyunThe watchdog can do pretimeouts. 231*4882a593Smuzhiyun 232*4882a593Smuzhiyun ================ ================================ 233*4882a593Smuzhiyun WDIOF_PRETIMEOUT Pretimeout (in seconds), get/set 234*4882a593Smuzhiyun ================ ================================ 235*4882a593Smuzhiyun 236*4882a593Smuzhiyun 237*4882a593SmuzhiyunFor those drivers that return any bits set in the option field, the 238*4882a593SmuzhiyunGETSTATUS and GETBOOTSTATUS ioctls can be used to ask for the current 239*4882a593Smuzhiyunstatus, and the status at the last reboot, respectively:: 240*4882a593Smuzhiyun 241*4882a593Smuzhiyun int flags; 242*4882a593Smuzhiyun ioctl(fd, WDIOC_GETSTATUS, &flags); 243*4882a593Smuzhiyun 244*4882a593Smuzhiyun or 245*4882a593Smuzhiyun 246*4882a593Smuzhiyun ioctl(fd, WDIOC_GETBOOTSTATUS, &flags); 247*4882a593Smuzhiyun 248*4882a593SmuzhiyunNote that not all devices support these two calls, and some only 249*4882a593Smuzhiyunsupport the GETBOOTSTATUS call. 250*4882a593Smuzhiyun 251*4882a593SmuzhiyunSome drivers can measure the temperature using the GETTEMP ioctl. The 252*4882a593Smuzhiyunreturned value is the temperature in degrees fahrenheit:: 253*4882a593Smuzhiyun 254*4882a593Smuzhiyun int temperature; 255*4882a593Smuzhiyun ioctl(fd, WDIOC_GETTEMP, &temperature); 256*4882a593Smuzhiyun 257*4882a593SmuzhiyunFinally the SETOPTIONS ioctl can be used to control some aspects of 258*4882a593Smuzhiyunthe cards operation:: 259*4882a593Smuzhiyun 260*4882a593Smuzhiyun int options = 0; 261*4882a593Smuzhiyun ioctl(fd, WDIOC_SETOPTIONS, &options); 262*4882a593Smuzhiyun 263*4882a593SmuzhiyunThe following options are available: 264*4882a593Smuzhiyun 265*4882a593Smuzhiyun ================= ================================ 266*4882a593Smuzhiyun WDIOS_DISABLECARD Turn off the watchdog timer 267*4882a593Smuzhiyun WDIOS_ENABLECARD Turn on the watchdog timer 268*4882a593Smuzhiyun WDIOS_TEMPPANIC Kernel panic on temperature trip 269*4882a593Smuzhiyun ================= ================================ 270*4882a593Smuzhiyun 271*4882a593Smuzhiyun[FIXME -- better explanations] 272