1*4882a593Smuzhiyun===================== 2*4882a593SmuzhiyunI/O statistics fields 3*4882a593Smuzhiyun===================== 4*4882a593Smuzhiyun 5*4882a593SmuzhiyunSince 2.4.20 (and some versions before, with patches), and 2.5.45, 6*4882a593Smuzhiyunmore extensive disk statistics have been introduced to help measure disk 7*4882a593Smuzhiyunactivity. Tools such as ``sar`` and ``iostat`` typically interpret these and do 8*4882a593Smuzhiyunthe work for you, but in case you are interested in creating your own 9*4882a593Smuzhiyuntools, the fields are explained here. 10*4882a593Smuzhiyun 11*4882a593SmuzhiyunIn 2.4 now, the information is found as additional fields in 12*4882a593Smuzhiyun``/proc/partitions``. In 2.6 and upper, the same information is found in two 13*4882a593Smuzhiyunplaces: one is in the file ``/proc/diskstats``, and the other is within 14*4882a593Smuzhiyunthe sysfs file system, which must be mounted in order to obtain 15*4882a593Smuzhiyunthe information. Throughout this document we'll assume that sysfs 16*4882a593Smuzhiyunis mounted on ``/sys``, although of course it may be mounted anywhere. 17*4882a593SmuzhiyunBoth ``/proc/diskstats`` and sysfs use the same source for the information 18*4882a593Smuzhiyunand so should not differ. 19*4882a593Smuzhiyun 20*4882a593SmuzhiyunHere are examples of these different formats:: 21*4882a593Smuzhiyun 22*4882a593Smuzhiyun 2.4: 23*4882a593Smuzhiyun 3 0 39082680 hda 446216 784926 9550688 4382310 424847 312726 5922052 19310380 0 3376340 23705160 24*4882a593Smuzhiyun 3 1 9221278 hda1 35486 0 35496 38030 0 0 0 0 0 38030 38030 25*4882a593Smuzhiyun 26*4882a593Smuzhiyun 2.6+ sysfs: 27*4882a593Smuzhiyun 446216 784926 9550688 4382310 424847 312726 5922052 19310380 0 3376340 23705160 28*4882a593Smuzhiyun 35486 38030 38030 38030 29*4882a593Smuzhiyun 30*4882a593Smuzhiyun 2.6+ diskstats: 31*4882a593Smuzhiyun 3 0 hda 446216 784926 9550688 4382310 424847 312726 5922052 19310380 0 3376340 23705160 32*4882a593Smuzhiyun 3 1 hda1 35486 38030 38030 38030 33*4882a593Smuzhiyun 34*4882a593Smuzhiyun 4.18+ diskstats: 35*4882a593Smuzhiyun 3 0 hda 446216 784926 9550688 4382310 424847 312726 5922052 19310380 0 3376340 23705160 0 0 0 0 36*4882a593Smuzhiyun 37*4882a593SmuzhiyunOn 2.4 you might execute ``grep 'hda ' /proc/partitions``. On 2.6+, you have 38*4882a593Smuzhiyuna choice of ``cat /sys/block/hda/stat`` or ``grep 'hda ' /proc/diskstats``. 39*4882a593Smuzhiyun 40*4882a593SmuzhiyunThe advantage of one over the other is that the sysfs choice works well 41*4882a593Smuzhiyunif you are watching a known, small set of disks. ``/proc/diskstats`` may 42*4882a593Smuzhiyunbe a better choice if you are watching a large number of disks because 43*4882a593Smuzhiyunyou'll avoid the overhead of 50, 100, or 500 or more opens/closes with 44*4882a593Smuzhiyuneach snapshot of your disk statistics. 45*4882a593Smuzhiyun 46*4882a593SmuzhiyunIn 2.4, the statistics fields are those after the device name. In 47*4882a593Smuzhiyunthe above example, the first field of statistics would be 446216. 48*4882a593SmuzhiyunBy contrast, in 2.6+ if you look at ``/sys/block/hda/stat``, you'll 49*4882a593Smuzhiyunfind just the 15 fields, beginning with 446216. If you look at 50*4882a593Smuzhiyun``/proc/diskstats``, the 15 fields will be preceded by the major and 51*4882a593Smuzhiyunminor device numbers, and device name. Each of these formats provides 52*4882a593Smuzhiyun15 fields of statistics, each meaning exactly the same things. 53*4882a593SmuzhiyunAll fields except field 9 are cumulative since boot. Field 9 should 54*4882a593Smuzhiyungo to zero as I/Os complete; all others only increase (unless they 55*4882a593Smuzhiyunoverflow and wrap). Wrapping might eventually occur on a very busy 56*4882a593Smuzhiyunor long-lived system; so applications should be prepared to deal with 57*4882a593Smuzhiyunit. Regarding wrapping, the types of the fields are either unsigned 58*4882a593Smuzhiyunint (32 bit) or unsigned long (32-bit or 64-bit, depending on your 59*4882a593Smuzhiyunmachine) as noted per-field below. Unless your observations are very 60*4882a593Smuzhiyunspread in time, these fields should not wrap twice before you notice it. 61*4882a593Smuzhiyun 62*4882a593SmuzhiyunEach set of stats only applies to the indicated device; if you want 63*4882a593Smuzhiyunsystem-wide stats you'll have to find all the devices and sum them all up. 64*4882a593Smuzhiyun 65*4882a593SmuzhiyunField 1 -- # of reads completed (unsigned long) 66*4882a593Smuzhiyun This is the total number of reads completed successfully. 67*4882a593Smuzhiyun 68*4882a593SmuzhiyunField 2 -- # of reads merged, field 6 -- # of writes merged (unsigned long) 69*4882a593Smuzhiyun Reads and writes which are adjacent to each other may be merged for 70*4882a593Smuzhiyun efficiency. Thus two 4K reads may become one 8K read before it is 71*4882a593Smuzhiyun ultimately handed to the disk, and so it will be counted (and queued) 72*4882a593Smuzhiyun as only one I/O. This field lets you know how often this was done. 73*4882a593Smuzhiyun 74*4882a593SmuzhiyunField 3 -- # of sectors read (unsigned long) 75*4882a593Smuzhiyun This is the total number of sectors read successfully. 76*4882a593Smuzhiyun 77*4882a593SmuzhiyunField 4 -- # of milliseconds spent reading (unsigned int) 78*4882a593Smuzhiyun This is the total number of milliseconds spent by all reads (as 79*4882a593Smuzhiyun measured from __make_request() to end_that_request_last()). 80*4882a593Smuzhiyun 81*4882a593SmuzhiyunField 5 -- # of writes completed (unsigned long) 82*4882a593Smuzhiyun This is the total number of writes completed successfully. 83*4882a593Smuzhiyun 84*4882a593SmuzhiyunField 6 -- # of writes merged (unsigned long) 85*4882a593Smuzhiyun See the description of field 2. 86*4882a593Smuzhiyun 87*4882a593SmuzhiyunField 7 -- # of sectors written (unsigned long) 88*4882a593Smuzhiyun This is the total number of sectors written successfully. 89*4882a593Smuzhiyun 90*4882a593SmuzhiyunField 8 -- # of milliseconds spent writing (unsigned int) 91*4882a593Smuzhiyun This is the total number of milliseconds spent by all writes (as 92*4882a593Smuzhiyun measured from __make_request() to end_that_request_last()). 93*4882a593Smuzhiyun 94*4882a593SmuzhiyunField 9 -- # of I/Os currently in progress (unsigned int) 95*4882a593Smuzhiyun The only field that should go to zero. Incremented as requests are 96*4882a593Smuzhiyun given to appropriate struct request_queue and decremented as they finish. 97*4882a593Smuzhiyun 98*4882a593SmuzhiyunField 10 -- # of milliseconds spent doing I/Os (unsigned int) 99*4882a593Smuzhiyun This field increases so long as field 9 is nonzero. 100*4882a593Smuzhiyun 101*4882a593Smuzhiyun Since 5.0 this field counts jiffies when at least one request was 102*4882a593Smuzhiyun started or completed. If request runs more than 2 jiffies then some 103*4882a593Smuzhiyun I/O time might be not accounted in case of concurrent requests. 104*4882a593Smuzhiyun 105*4882a593SmuzhiyunField 11 -- weighted # of milliseconds spent doing I/Os (unsigned int) 106*4882a593Smuzhiyun This field is incremented at each I/O start, I/O completion, I/O 107*4882a593Smuzhiyun merge, or read of these stats by the number of I/Os in progress 108*4882a593Smuzhiyun (field 9) times the number of milliseconds spent doing I/O since the 109*4882a593Smuzhiyun last update of this field. This can provide an easy measure of both 110*4882a593Smuzhiyun I/O completion time and the backlog that may be accumulating. 111*4882a593Smuzhiyun 112*4882a593SmuzhiyunField 12 -- # of discards completed (unsigned long) 113*4882a593Smuzhiyun This is the total number of discards completed successfully. 114*4882a593Smuzhiyun 115*4882a593SmuzhiyunField 13 -- # of discards merged (unsigned long) 116*4882a593Smuzhiyun See the description of field 2 117*4882a593Smuzhiyun 118*4882a593SmuzhiyunField 14 -- # of sectors discarded (unsigned long) 119*4882a593Smuzhiyun This is the total number of sectors discarded successfully. 120*4882a593Smuzhiyun 121*4882a593SmuzhiyunField 15 -- # of milliseconds spent discarding (unsigned int) 122*4882a593Smuzhiyun This is the total number of milliseconds spent by all discards (as 123*4882a593Smuzhiyun measured from __make_request() to end_that_request_last()). 124*4882a593Smuzhiyun 125*4882a593SmuzhiyunField 16 -- # of flush requests completed 126*4882a593Smuzhiyun This is the total number of flush requests completed successfully. 127*4882a593Smuzhiyun 128*4882a593Smuzhiyun Block layer combines flush requests and executes at most one at a time. 129*4882a593Smuzhiyun This counts flush requests executed by disk. Not tracked for partitions. 130*4882a593Smuzhiyun 131*4882a593SmuzhiyunField 17 -- # of milliseconds spent flushing 132*4882a593Smuzhiyun This is the total number of milliseconds spent by all flush requests. 133*4882a593Smuzhiyun 134*4882a593SmuzhiyunTo avoid introducing performance bottlenecks, no locks are held while 135*4882a593Smuzhiyunmodifying these counters. This implies that minor inaccuracies may be 136*4882a593Smuzhiyunintroduced when changes collide, so (for instance) adding up all the 137*4882a593Smuzhiyunread I/Os issued per partition should equal those made to the disks ... 138*4882a593Smuzhiyunbut due to the lack of locking it may only be very close. 139*4882a593Smuzhiyun 140*4882a593SmuzhiyunIn 2.6+, there are counters for each CPU, which make the lack of locking 141*4882a593Smuzhiyunalmost a non-issue. When the statistics are read, the per-CPU counters 142*4882a593Smuzhiyunare summed (possibly overflowing the unsigned long variable they are 143*4882a593Smuzhiyunsummed to) and the result given to the user. There is no convenient 144*4882a593Smuzhiyunuser interface for accessing the per-CPU counters themselves. 145*4882a593Smuzhiyun 146*4882a593SmuzhiyunSince 4.19 request times are measured with nanoseconds precision and 147*4882a593Smuzhiyuntruncated to milliseconds before showing in this interface. 148*4882a593Smuzhiyun 149*4882a593SmuzhiyunDisks vs Partitions 150*4882a593Smuzhiyun------------------- 151*4882a593Smuzhiyun 152*4882a593SmuzhiyunThere were significant changes between 2.4 and 2.6+ in the I/O subsystem. 153*4882a593SmuzhiyunAs a result, some statistic information disappeared. The translation from 154*4882a593Smuzhiyuna disk address relative to a partition to the disk address relative to 155*4882a593Smuzhiyunthe host disk happens much earlier. All merges and timings now happen 156*4882a593Smuzhiyunat the disk level rather than at both the disk and partition level as 157*4882a593Smuzhiyunin 2.4. Consequently, you'll see a different statistics output on 2.6+ for 158*4882a593Smuzhiyunpartitions from that for disks. There are only *four* fields available 159*4882a593Smuzhiyunfor partitions on 2.6+ machines. This is reflected in the examples above. 160*4882a593Smuzhiyun 161*4882a593SmuzhiyunField 1 -- # of reads issued 162*4882a593Smuzhiyun This is the total number of reads issued to this partition. 163*4882a593Smuzhiyun 164*4882a593SmuzhiyunField 2 -- # of sectors read 165*4882a593Smuzhiyun This is the total number of sectors requested to be read from this 166*4882a593Smuzhiyun partition. 167*4882a593Smuzhiyun 168*4882a593SmuzhiyunField 3 -- # of writes issued 169*4882a593Smuzhiyun This is the total number of writes issued to this partition. 170*4882a593Smuzhiyun 171*4882a593SmuzhiyunField 4 -- # of sectors written 172*4882a593Smuzhiyun This is the total number of sectors requested to be written to 173*4882a593Smuzhiyun this partition. 174*4882a593Smuzhiyun 175*4882a593SmuzhiyunNote that since the address is translated to a disk-relative one, and no 176*4882a593Smuzhiyunrecord of the partition-relative address is kept, the subsequent success 177*4882a593Smuzhiyunor failure of the read cannot be attributed to the partition. In other 178*4882a593Smuzhiyunwords, the number of reads for partitions is counted slightly before time 179*4882a593Smuzhiyunof queuing for partitions, and at completion for whole disks. This is 180*4882a593Smuzhiyuna subtle distinction that is probably uninteresting for most cases. 181*4882a593Smuzhiyun 182*4882a593SmuzhiyunMore significant is the error induced by counting the numbers of 183*4882a593Smuzhiyunreads/writes before merges for partitions and after for disks. Since a 184*4882a593Smuzhiyuntypical workload usually contains a lot of successive and adjacent requests, 185*4882a593Smuzhiyunthe number of reads/writes issued can be several times higher than the 186*4882a593Smuzhiyunnumber of reads/writes completed. 187*4882a593Smuzhiyun 188*4882a593SmuzhiyunIn 2.6.25, the full statistic set is again available for partitions and 189*4882a593Smuzhiyundisk and partition statistics are consistent again. Since we still don't 190*4882a593Smuzhiyunkeep record of the partition-relative address, an operation is attributed to 191*4882a593Smuzhiyunthe partition which contains the first sector of the request after the 192*4882a593Smuzhiyuneventual merges. As requests can be merged across partition, this could lead 193*4882a593Smuzhiyunto some (probably insignificant) inaccuracy. 194*4882a593Smuzhiyun 195*4882a593SmuzhiyunAdditional notes 196*4882a593Smuzhiyun---------------- 197*4882a593Smuzhiyun 198*4882a593SmuzhiyunIn 2.6+, sysfs is not mounted by default. If your distribution of 199*4882a593SmuzhiyunLinux hasn't added it already, here's the line you'll want to add to 200*4882a593Smuzhiyunyour ``/etc/fstab``:: 201*4882a593Smuzhiyun 202*4882a593Smuzhiyun none /sys sysfs defaults 0 0 203*4882a593Smuzhiyun 204*4882a593Smuzhiyun 205*4882a593SmuzhiyunIn 2.6+, all disk statistics were removed from ``/proc/stat``. In 2.4, they 206*4882a593Smuzhiyunappear in both ``/proc/partitions`` and ``/proc/stat``, although the ones in 207*4882a593Smuzhiyun``/proc/stat`` take a very different format from those in ``/proc/partitions`` 208*4882a593Smuzhiyun(see proc(5), if your system has it.) 209*4882a593Smuzhiyun 210*4882a593Smuzhiyun-- ricklind@us.ibm.com 211