Documentation/admin-guide/iostats.rst

*4882a593Smuzhiyun=====================
*4882a593SmuzhiyunI/O statistics fields
*4882a593Smuzhiyun=====================
*4882a593Smuzhiyun
*4882a593SmuzhiyunSince 2.4.20 (and some versions before, with patches), and 2.5.45,
*4882a593Smuzhiyunmore extensive disk statistics have been introduced to help measure disk
*4882a593Smuzhiyunactivity. Tools such as ``sar`` and ``iostat`` typically interpret these and do
*4882a593Smuzhiyunthe work for you, but in case you are interested in creating your own
*4882a593Smuzhiyuntools, the fields are explained here.
*4882a593Smuzhiyun
*4882a593SmuzhiyunIn 2.4 now, the information is found as additional fields in
*4882a593Smuzhiyun``/proc/partitions``.  In 2.6 and upper, the same information is found in two
*4882a593Smuzhiyunplaces: one is in the file ``/proc/diskstats``, and the other is within
*4882a593Smuzhiyunthe sysfs file system, which must be mounted in order to obtain
*4882a593Smuzhiyunthe information. Throughout this document we'll assume that sysfs
*4882a593Smuzhiyunis mounted on ``/sys``, although of course it may be mounted anywhere.
*4882a593SmuzhiyunBoth ``/proc/diskstats`` and sysfs use the same source for the information
*4882a593Smuzhiyunand so should not differ.
*4882a593Smuzhiyun
*4882a593SmuzhiyunHere are examples of these different formats::
*4882a593Smuzhiyun
*4882a593Smuzhiyun   2.4:
*4882a593Smuzhiyun      3     0   39082680 hda 446216 784926 9550688 4382310 424847 312726 5922052 19310380 0 3376340 23705160
*4882a593Smuzhiyun      3     1    9221278 hda1 35486 0 35496 38030 0 0 0 0 0 38030 38030
*4882a593Smuzhiyun
*4882a593Smuzhiyun   2.6+ sysfs:
*4882a593Smuzhiyun      446216 784926 9550688 4382310 424847 312726 5922052 19310380 0 3376340 23705160
*4882a593Smuzhiyun      35486    38030    38030    38030
*4882a593Smuzhiyun
*4882a593Smuzhiyun   2.6+ diskstats:
*4882a593Smuzhiyun      3    0   hda 446216 784926 9550688 4382310 424847 312726 5922052 19310380 0 3376340 23705160
*4882a593Smuzhiyun      3    1   hda1 35486 38030 38030 38030
*4882a593Smuzhiyun
*4882a593Smuzhiyun   4.18+ diskstats:
*4882a593Smuzhiyun      3    0   hda 446216 784926 9550688 4382310 424847 312726 5922052 19310380 0 3376340 23705160 0 0 0 0
*4882a593Smuzhiyun
*4882a593SmuzhiyunOn 2.4 you might execute ``grep 'hda ' /proc/partitions``. On 2.6+, you have
*4882a593Smuzhiyuna choice of ``cat /sys/block/hda/stat`` or ``grep 'hda ' /proc/diskstats``.
*4882a593Smuzhiyun
*4882a593SmuzhiyunThe advantage of one over the other is that the sysfs choice works well
*4882a593Smuzhiyunif you are watching a known, small set of disks.  ``/proc/diskstats`` may
*4882a593Smuzhiyunbe a better choice if you are watching a large number of disks because
*4882a593Smuzhiyunyou'll avoid the overhead of 50, 100, or 500 or more opens/closes with
*4882a593Smuzhiyuneach snapshot of your disk statistics.
*4882a593Smuzhiyun
*4882a593SmuzhiyunIn 2.4, the statistics fields are those after the device name. In
*4882a593Smuzhiyunthe above example, the first field of statistics would be 446216.
*4882a593SmuzhiyunBy contrast, in 2.6+ if you look at ``/sys/block/hda/stat``, you'll
*4882a593Smuzhiyunfind just the 15 fields, beginning with 446216.  If you look at
*4882a593Smuzhiyun``/proc/diskstats``, the 15 fields will be preceded by the major and
*4882a593Smuzhiyunminor device numbers, and device name.  Each of these formats provides
*4882a593Smuzhiyun15 fields of statistics, each meaning exactly the same things.
*4882a593SmuzhiyunAll fields except field 9 are cumulative since boot.  Field 9 should
*4882a593Smuzhiyungo to zero as I/Os complete; all others only increase (unless they
*4882a593Smuzhiyunoverflow and wrap). Wrapping might eventually occur on a very busy
*4882a593Smuzhiyunor long-lived system; so applications should be prepared to deal with
*4882a593Smuzhiyunit. Regarding wrapping, the types of the fields are either unsigned
*4882a593Smuzhiyunint (32 bit) or unsigned long (32-bit or 64-bit, depending on your
*4882a593Smuzhiyunmachine) as noted per-field below. Unless your observations are very
*4882a593Smuzhiyunspread in time, these fields should not wrap twice before you notice it.
*4882a593Smuzhiyun
*4882a593SmuzhiyunEach set of stats only applies to the indicated device; if you want
*4882a593Smuzhiyunsystem-wide stats you'll have to find all the devices and sum them all up.
*4882a593Smuzhiyun
*4882a593SmuzhiyunField  1 -- # of reads completed (unsigned long)
*4882a593Smuzhiyun    This is the total number of reads completed successfully.
*4882a593Smuzhiyun
*4882a593SmuzhiyunField  2 -- # of reads merged, field 6 -- # of writes merged (unsigned long)
*4882a593Smuzhiyun    Reads and writes which are adjacent to each other may be merged for
*4882a593Smuzhiyun    efficiency.  Thus two 4K reads may become one 8K read before it is
*4882a593Smuzhiyun    ultimately handed to the disk, and so it will be counted (and queued)
*4882a593Smuzhiyun    as only one I/O.  This field lets you know how often this was done.
*4882a593Smuzhiyun
*4882a593SmuzhiyunField  3 -- # of sectors read (unsigned long)
*4882a593Smuzhiyun    This is the total number of sectors read successfully.
*4882a593Smuzhiyun
*4882a593SmuzhiyunField  4 -- # of milliseconds spent reading (unsigned int)
*4882a593Smuzhiyun    This is the total number of milliseconds spent by all reads (as
*4882a593Smuzhiyun    measured from __make_request() to end_that_request_last()).
*4882a593Smuzhiyun
*4882a593SmuzhiyunField  5 -- # of writes completed (unsigned long)
*4882a593Smuzhiyun    This is the total number of writes completed successfully.
*4882a593Smuzhiyun
*4882a593SmuzhiyunField  6 -- # of writes merged  (unsigned long)
*4882a593Smuzhiyun    See the description of field 2.
*4882a593Smuzhiyun
*4882a593SmuzhiyunField  7 -- # of sectors written (unsigned long)
*4882a593Smuzhiyun    This is the total number of sectors written successfully.
*4882a593Smuzhiyun
*4882a593SmuzhiyunField  8 -- # of milliseconds spent writing (unsigned int)
*4882a593Smuzhiyun    This is the total number of milliseconds spent by all writes (as
*4882a593Smuzhiyun    measured from __make_request() to end_that_request_last()).
*4882a593Smuzhiyun
*4882a593SmuzhiyunField  9 -- # of I/Os currently in progress (unsigned int)
*4882a593Smuzhiyun    The only field that should go to zero. Incremented as requests are
*4882a593Smuzhiyun    given to appropriate struct request_queue and decremented as they finish.
*4882a593Smuzhiyun
*4882a593SmuzhiyunField 10 -- # of milliseconds spent doing I/Os (unsigned int)
*4882a593Smuzhiyun    This field increases so long as field 9 is nonzero.
*4882a593Smuzhiyun
*4882a593Smuzhiyun    Since 5.0 this field counts jiffies when at least one request was
*4882a593Smuzhiyun    started or completed. If request runs more than 2 jiffies then some
*4882a593Smuzhiyun    I/O time might be not accounted in case of concurrent requests.
*4882a593Smuzhiyun
*4882a593SmuzhiyunField 11 -- weighted # of milliseconds spent doing I/Os (unsigned int)
*4882a593Smuzhiyun    This field is incremented at each I/O start, I/O completion, I/O
*4882a593Smuzhiyun    merge, or read of these stats by the number of I/Os in progress
*4882a593Smuzhiyun    (field 9) times the number of milliseconds spent doing I/O since the
*4882a593Smuzhiyun    last update of this field.  This can provide an easy measure of both
*4882a593Smuzhiyun    I/O completion time and the backlog that may be accumulating.
*4882a593Smuzhiyun
*4882a593SmuzhiyunField 12 -- # of discards completed (unsigned long)
*4882a593Smuzhiyun    This is the total number of discards completed successfully.
*4882a593Smuzhiyun
*4882a593SmuzhiyunField 13 -- # of discards merged (unsigned long)
*4882a593Smuzhiyun    See the description of field 2
*4882a593Smuzhiyun
*4882a593SmuzhiyunField 14 -- # of sectors discarded (unsigned long)
*4882a593Smuzhiyun    This is the total number of sectors discarded successfully.
*4882a593Smuzhiyun
*4882a593SmuzhiyunField 15 -- # of milliseconds spent discarding (unsigned int)
*4882a593Smuzhiyun    This is the total number of milliseconds spent by all discards (as
*4882a593Smuzhiyun    measured from __make_request() to end_that_request_last()).
*4882a593Smuzhiyun
*4882a593SmuzhiyunField 16 -- # of flush requests completed
*4882a593Smuzhiyun    This is the total number of flush requests completed successfully.
*4882a593Smuzhiyun
*4882a593Smuzhiyun    Block layer combines flush requests and executes at most one at a time.
*4882a593Smuzhiyun    This counts flush requests executed by disk. Not tracked for partitions.
*4882a593Smuzhiyun
*4882a593SmuzhiyunField 17 -- # of milliseconds spent flushing
*4882a593Smuzhiyun    This is the total number of milliseconds spent by all flush requests.
*4882a593Smuzhiyun
*4882a593SmuzhiyunTo avoid introducing performance bottlenecks, no locks are held while
*4882a593Smuzhiyunmodifying these counters.  This implies that minor inaccuracies may be
*4882a593Smuzhiyunintroduced when changes collide, so (for instance) adding up all the
*4882a593Smuzhiyunread I/Os issued per partition should equal those made to the disks ...
*4882a593Smuzhiyunbut due to the lack of locking it may only be very close.
*4882a593Smuzhiyun
*4882a593SmuzhiyunIn 2.6+, there are counters for each CPU, which make the lack of locking
*4882a593Smuzhiyunalmost a non-issue.  When the statistics are read, the per-CPU counters
*4882a593Smuzhiyunare summed (possibly overflowing the unsigned long variable they are
*4882a593Smuzhiyunsummed to) and the result given to the user.  There is no convenient
*4882a593Smuzhiyunuser interface for accessing the per-CPU counters themselves.
*4882a593Smuzhiyun
*4882a593SmuzhiyunSince 4.19 request times are measured with nanoseconds precision and
*4882a593Smuzhiyuntruncated to milliseconds before showing in this interface.
*4882a593Smuzhiyun
*4882a593SmuzhiyunDisks vs Partitions
*4882a593Smuzhiyun-------------------
*4882a593Smuzhiyun
*4882a593SmuzhiyunThere were significant changes between 2.4 and 2.6+ in the I/O subsystem.
*4882a593SmuzhiyunAs a result, some statistic information disappeared. The translation from
*4882a593Smuzhiyuna disk address relative to a partition to the disk address relative to
*4882a593Smuzhiyunthe host disk happens much earlier.  All merges and timings now happen
*4882a593Smuzhiyunat the disk level rather than at both the disk and partition level as
*4882a593Smuzhiyunin 2.4.  Consequently, you'll see a different statistics output on 2.6+ for
*4882a593Smuzhiyunpartitions from that for disks.  There are only *four* fields available
*4882a593Smuzhiyunfor partitions on 2.6+ machines.  This is reflected in the examples above.
*4882a593Smuzhiyun
*4882a593SmuzhiyunField  1 -- # of reads issued
*4882a593Smuzhiyun    This is the total number of reads issued to this partition.
*4882a593Smuzhiyun
*4882a593SmuzhiyunField  2 -- # of sectors read
*4882a593Smuzhiyun    This is the total number of sectors requested to be read from this
*4882a593Smuzhiyun    partition.
*4882a593Smuzhiyun
*4882a593SmuzhiyunField  3 -- # of writes issued
*4882a593Smuzhiyun    This is the total number of writes issued to this partition.
*4882a593Smuzhiyun
*4882a593SmuzhiyunField  4 -- # of sectors written
*4882a593Smuzhiyun    This is the total number of sectors requested to be written to
*4882a593Smuzhiyun    this partition.
*4882a593Smuzhiyun
*4882a593SmuzhiyunNote that since the address is translated to a disk-relative one, and no
*4882a593Smuzhiyunrecord of the partition-relative address is kept, the subsequent success
*4882a593Smuzhiyunor failure of the read cannot be attributed to the partition.  In other
*4882a593Smuzhiyunwords, the number of reads for partitions is counted slightly before time
*4882a593Smuzhiyunof queuing for partitions, and at completion for whole disks.  This is
*4882a593Smuzhiyuna subtle distinction that is probably uninteresting for most cases.
*4882a593Smuzhiyun
*4882a593SmuzhiyunMore significant is the error induced by counting the numbers of
*4882a593Smuzhiyunreads/writes before merges for partitions and after for disks. Since a
*4882a593Smuzhiyuntypical workload usually contains a lot of successive and adjacent requests,
*4882a593Smuzhiyunthe number of reads/writes issued can be several times higher than the
*4882a593Smuzhiyunnumber of reads/writes completed.
*4882a593Smuzhiyun
*4882a593SmuzhiyunIn 2.6.25, the full statistic set is again available for partitions and
*4882a593Smuzhiyundisk and partition statistics are consistent again. Since we still don't
*4882a593Smuzhiyunkeep record of the partition-relative address, an operation is attributed to
*4882a593Smuzhiyunthe partition which contains the first sector of the request after the
*4882a593Smuzhiyuneventual merges. As requests can be merged across partition, this could lead
*4882a593Smuzhiyunto some (probably insignificant) inaccuracy.
*4882a593Smuzhiyun
*4882a593SmuzhiyunAdditional notes
*4882a593Smuzhiyun----------------
*4882a593Smuzhiyun
*4882a593SmuzhiyunIn 2.6+, sysfs is not mounted by default.  If your distribution of
*4882a593SmuzhiyunLinux hasn't added it already, here's the line you'll want to add to
*4882a593Smuzhiyunyour ``/etc/fstab``::
*4882a593Smuzhiyun
*4882a593Smuzhiyun	none /sys sysfs defaults 0 0
*4882a593Smuzhiyun
*4882a593Smuzhiyun
*4882a593SmuzhiyunIn 2.6+, all disk statistics were removed from ``/proc/stat``.  In 2.4, they
*4882a593Smuzhiyunappear in both ``/proc/partitions`` and ``/proc/stat``, although the ones in
*4882a593Smuzhiyun``/proc/stat`` take a very different format from those in ``/proc/partitions``
*4882a593Smuzhiyun(see proc(5), if your system has it.)
*4882a593Smuzhiyun
*4882a593Smuzhiyun-- ricklind@us.ibm.com