xref: /OK3568_Linux_fs/kernel/Documentation/admin-guide/iostats.rst (revision 4882a59341e53eb6f0b4789bf948001014eff981)
1*4882a593Smuzhiyun=====================
2*4882a593SmuzhiyunI/O statistics fields
3*4882a593Smuzhiyun=====================
4*4882a593Smuzhiyun
5*4882a593SmuzhiyunSince 2.4.20 (and some versions before, with patches), and 2.5.45,
6*4882a593Smuzhiyunmore extensive disk statistics have been introduced to help measure disk
7*4882a593Smuzhiyunactivity. Tools such as ``sar`` and ``iostat`` typically interpret these and do
8*4882a593Smuzhiyunthe work for you, but in case you are interested in creating your own
9*4882a593Smuzhiyuntools, the fields are explained here.
10*4882a593Smuzhiyun
11*4882a593SmuzhiyunIn 2.4 now, the information is found as additional fields in
12*4882a593Smuzhiyun``/proc/partitions``.  In 2.6 and upper, the same information is found in two
13*4882a593Smuzhiyunplaces: one is in the file ``/proc/diskstats``, and the other is within
14*4882a593Smuzhiyunthe sysfs file system, which must be mounted in order to obtain
15*4882a593Smuzhiyunthe information. Throughout this document we'll assume that sysfs
16*4882a593Smuzhiyunis mounted on ``/sys``, although of course it may be mounted anywhere.
17*4882a593SmuzhiyunBoth ``/proc/diskstats`` and sysfs use the same source for the information
18*4882a593Smuzhiyunand so should not differ.
19*4882a593Smuzhiyun
20*4882a593SmuzhiyunHere are examples of these different formats::
21*4882a593Smuzhiyun
22*4882a593Smuzhiyun   2.4:
23*4882a593Smuzhiyun      3     0   39082680 hda 446216 784926 9550688 4382310 424847 312726 5922052 19310380 0 3376340 23705160
24*4882a593Smuzhiyun      3     1    9221278 hda1 35486 0 35496 38030 0 0 0 0 0 38030 38030
25*4882a593Smuzhiyun
26*4882a593Smuzhiyun   2.6+ sysfs:
27*4882a593Smuzhiyun      446216 784926 9550688 4382310 424847 312726 5922052 19310380 0 3376340 23705160
28*4882a593Smuzhiyun      35486    38030    38030    38030
29*4882a593Smuzhiyun
30*4882a593Smuzhiyun   2.6+ diskstats:
31*4882a593Smuzhiyun      3    0   hda 446216 784926 9550688 4382310 424847 312726 5922052 19310380 0 3376340 23705160
32*4882a593Smuzhiyun      3    1   hda1 35486 38030 38030 38030
33*4882a593Smuzhiyun
34*4882a593Smuzhiyun   4.18+ diskstats:
35*4882a593Smuzhiyun      3    0   hda 446216 784926 9550688 4382310 424847 312726 5922052 19310380 0 3376340 23705160 0 0 0 0
36*4882a593Smuzhiyun
37*4882a593SmuzhiyunOn 2.4 you might execute ``grep 'hda ' /proc/partitions``. On 2.6+, you have
38*4882a593Smuzhiyuna choice of ``cat /sys/block/hda/stat`` or ``grep 'hda ' /proc/diskstats``.
39*4882a593Smuzhiyun
40*4882a593SmuzhiyunThe advantage of one over the other is that the sysfs choice works well
41*4882a593Smuzhiyunif you are watching a known, small set of disks.  ``/proc/diskstats`` may
42*4882a593Smuzhiyunbe a better choice if you are watching a large number of disks because
43*4882a593Smuzhiyunyou'll avoid the overhead of 50, 100, or 500 or more opens/closes with
44*4882a593Smuzhiyuneach snapshot of your disk statistics.
45*4882a593Smuzhiyun
46*4882a593SmuzhiyunIn 2.4, the statistics fields are those after the device name. In
47*4882a593Smuzhiyunthe above example, the first field of statistics would be 446216.
48*4882a593SmuzhiyunBy contrast, in 2.6+ if you look at ``/sys/block/hda/stat``, you'll
49*4882a593Smuzhiyunfind just the 15 fields, beginning with 446216.  If you look at
50*4882a593Smuzhiyun``/proc/diskstats``, the 15 fields will be preceded by the major and
51*4882a593Smuzhiyunminor device numbers, and device name.  Each of these formats provides
52*4882a593Smuzhiyun15 fields of statistics, each meaning exactly the same things.
53*4882a593SmuzhiyunAll fields except field 9 are cumulative since boot.  Field 9 should
54*4882a593Smuzhiyungo to zero as I/Os complete; all others only increase (unless they
55*4882a593Smuzhiyunoverflow and wrap). Wrapping might eventually occur on a very busy
56*4882a593Smuzhiyunor long-lived system; so applications should be prepared to deal with
57*4882a593Smuzhiyunit. Regarding wrapping, the types of the fields are either unsigned
58*4882a593Smuzhiyunint (32 bit) or unsigned long (32-bit or 64-bit, depending on your
59*4882a593Smuzhiyunmachine) as noted per-field below. Unless your observations are very
60*4882a593Smuzhiyunspread in time, these fields should not wrap twice before you notice it.
61*4882a593Smuzhiyun
62*4882a593SmuzhiyunEach set of stats only applies to the indicated device; if you want
63*4882a593Smuzhiyunsystem-wide stats you'll have to find all the devices and sum them all up.
64*4882a593Smuzhiyun
65*4882a593SmuzhiyunField  1 -- # of reads completed (unsigned long)
66*4882a593Smuzhiyun    This is the total number of reads completed successfully.
67*4882a593Smuzhiyun
68*4882a593SmuzhiyunField  2 -- # of reads merged, field 6 -- # of writes merged (unsigned long)
69*4882a593Smuzhiyun    Reads and writes which are adjacent to each other may be merged for
70*4882a593Smuzhiyun    efficiency.  Thus two 4K reads may become one 8K read before it is
71*4882a593Smuzhiyun    ultimately handed to the disk, and so it will be counted (and queued)
72*4882a593Smuzhiyun    as only one I/O.  This field lets you know how often this was done.
73*4882a593Smuzhiyun
74*4882a593SmuzhiyunField  3 -- # of sectors read (unsigned long)
75*4882a593Smuzhiyun    This is the total number of sectors read successfully.
76*4882a593Smuzhiyun
77*4882a593SmuzhiyunField  4 -- # of milliseconds spent reading (unsigned int)
78*4882a593Smuzhiyun    This is the total number of milliseconds spent by all reads (as
79*4882a593Smuzhiyun    measured from __make_request() to end_that_request_last()).
80*4882a593Smuzhiyun
81*4882a593SmuzhiyunField  5 -- # of writes completed (unsigned long)
82*4882a593Smuzhiyun    This is the total number of writes completed successfully.
83*4882a593Smuzhiyun
84*4882a593SmuzhiyunField  6 -- # of writes merged  (unsigned long)
85*4882a593Smuzhiyun    See the description of field 2.
86*4882a593Smuzhiyun
87*4882a593SmuzhiyunField  7 -- # of sectors written (unsigned long)
88*4882a593Smuzhiyun    This is the total number of sectors written successfully.
89*4882a593Smuzhiyun
90*4882a593SmuzhiyunField  8 -- # of milliseconds spent writing (unsigned int)
91*4882a593Smuzhiyun    This is the total number of milliseconds spent by all writes (as
92*4882a593Smuzhiyun    measured from __make_request() to end_that_request_last()).
93*4882a593Smuzhiyun
94*4882a593SmuzhiyunField  9 -- # of I/Os currently in progress (unsigned int)
95*4882a593Smuzhiyun    The only field that should go to zero. Incremented as requests are
96*4882a593Smuzhiyun    given to appropriate struct request_queue and decremented as they finish.
97*4882a593Smuzhiyun
98*4882a593SmuzhiyunField 10 -- # of milliseconds spent doing I/Os (unsigned int)
99*4882a593Smuzhiyun    This field increases so long as field 9 is nonzero.
100*4882a593Smuzhiyun
101*4882a593Smuzhiyun    Since 5.0 this field counts jiffies when at least one request was
102*4882a593Smuzhiyun    started or completed. If request runs more than 2 jiffies then some
103*4882a593Smuzhiyun    I/O time might be not accounted in case of concurrent requests.
104*4882a593Smuzhiyun
105*4882a593SmuzhiyunField 11 -- weighted # of milliseconds spent doing I/Os (unsigned int)
106*4882a593Smuzhiyun    This field is incremented at each I/O start, I/O completion, I/O
107*4882a593Smuzhiyun    merge, or read of these stats by the number of I/Os in progress
108*4882a593Smuzhiyun    (field 9) times the number of milliseconds spent doing I/O since the
109*4882a593Smuzhiyun    last update of this field.  This can provide an easy measure of both
110*4882a593Smuzhiyun    I/O completion time and the backlog that may be accumulating.
111*4882a593Smuzhiyun
112*4882a593SmuzhiyunField 12 -- # of discards completed (unsigned long)
113*4882a593Smuzhiyun    This is the total number of discards completed successfully.
114*4882a593Smuzhiyun
115*4882a593SmuzhiyunField 13 -- # of discards merged (unsigned long)
116*4882a593Smuzhiyun    See the description of field 2
117*4882a593Smuzhiyun
118*4882a593SmuzhiyunField 14 -- # of sectors discarded (unsigned long)
119*4882a593Smuzhiyun    This is the total number of sectors discarded successfully.
120*4882a593Smuzhiyun
121*4882a593SmuzhiyunField 15 -- # of milliseconds spent discarding (unsigned int)
122*4882a593Smuzhiyun    This is the total number of milliseconds spent by all discards (as
123*4882a593Smuzhiyun    measured from __make_request() to end_that_request_last()).
124*4882a593Smuzhiyun
125*4882a593SmuzhiyunField 16 -- # of flush requests completed
126*4882a593Smuzhiyun    This is the total number of flush requests completed successfully.
127*4882a593Smuzhiyun
128*4882a593Smuzhiyun    Block layer combines flush requests and executes at most one at a time.
129*4882a593Smuzhiyun    This counts flush requests executed by disk. Not tracked for partitions.
130*4882a593Smuzhiyun
131*4882a593SmuzhiyunField 17 -- # of milliseconds spent flushing
132*4882a593Smuzhiyun    This is the total number of milliseconds spent by all flush requests.
133*4882a593Smuzhiyun
134*4882a593SmuzhiyunTo avoid introducing performance bottlenecks, no locks are held while
135*4882a593Smuzhiyunmodifying these counters.  This implies that minor inaccuracies may be
136*4882a593Smuzhiyunintroduced when changes collide, so (for instance) adding up all the
137*4882a593Smuzhiyunread I/Os issued per partition should equal those made to the disks ...
138*4882a593Smuzhiyunbut due to the lack of locking it may only be very close.
139*4882a593Smuzhiyun
140*4882a593SmuzhiyunIn 2.6+, there are counters for each CPU, which make the lack of locking
141*4882a593Smuzhiyunalmost a non-issue.  When the statistics are read, the per-CPU counters
142*4882a593Smuzhiyunare summed (possibly overflowing the unsigned long variable they are
143*4882a593Smuzhiyunsummed to) and the result given to the user.  There is no convenient
144*4882a593Smuzhiyunuser interface for accessing the per-CPU counters themselves.
145*4882a593Smuzhiyun
146*4882a593SmuzhiyunSince 4.19 request times are measured with nanoseconds precision and
147*4882a593Smuzhiyuntruncated to milliseconds before showing in this interface.
148*4882a593Smuzhiyun
149*4882a593SmuzhiyunDisks vs Partitions
150*4882a593Smuzhiyun-------------------
151*4882a593Smuzhiyun
152*4882a593SmuzhiyunThere were significant changes between 2.4 and 2.6+ in the I/O subsystem.
153*4882a593SmuzhiyunAs a result, some statistic information disappeared. The translation from
154*4882a593Smuzhiyuna disk address relative to a partition to the disk address relative to
155*4882a593Smuzhiyunthe host disk happens much earlier.  All merges and timings now happen
156*4882a593Smuzhiyunat the disk level rather than at both the disk and partition level as
157*4882a593Smuzhiyunin 2.4.  Consequently, you'll see a different statistics output on 2.6+ for
158*4882a593Smuzhiyunpartitions from that for disks.  There are only *four* fields available
159*4882a593Smuzhiyunfor partitions on 2.6+ machines.  This is reflected in the examples above.
160*4882a593Smuzhiyun
161*4882a593SmuzhiyunField  1 -- # of reads issued
162*4882a593Smuzhiyun    This is the total number of reads issued to this partition.
163*4882a593Smuzhiyun
164*4882a593SmuzhiyunField  2 -- # of sectors read
165*4882a593Smuzhiyun    This is the total number of sectors requested to be read from this
166*4882a593Smuzhiyun    partition.
167*4882a593Smuzhiyun
168*4882a593SmuzhiyunField  3 -- # of writes issued
169*4882a593Smuzhiyun    This is the total number of writes issued to this partition.
170*4882a593Smuzhiyun
171*4882a593SmuzhiyunField  4 -- # of sectors written
172*4882a593Smuzhiyun    This is the total number of sectors requested to be written to
173*4882a593Smuzhiyun    this partition.
174*4882a593Smuzhiyun
175*4882a593SmuzhiyunNote that since the address is translated to a disk-relative one, and no
176*4882a593Smuzhiyunrecord of the partition-relative address is kept, the subsequent success
177*4882a593Smuzhiyunor failure of the read cannot be attributed to the partition.  In other
178*4882a593Smuzhiyunwords, the number of reads for partitions is counted slightly before time
179*4882a593Smuzhiyunof queuing for partitions, and at completion for whole disks.  This is
180*4882a593Smuzhiyuna subtle distinction that is probably uninteresting for most cases.
181*4882a593Smuzhiyun
182*4882a593SmuzhiyunMore significant is the error induced by counting the numbers of
183*4882a593Smuzhiyunreads/writes before merges for partitions and after for disks. Since a
184*4882a593Smuzhiyuntypical workload usually contains a lot of successive and adjacent requests,
185*4882a593Smuzhiyunthe number of reads/writes issued can be several times higher than the
186*4882a593Smuzhiyunnumber of reads/writes completed.
187*4882a593Smuzhiyun
188*4882a593SmuzhiyunIn 2.6.25, the full statistic set is again available for partitions and
189*4882a593Smuzhiyundisk and partition statistics are consistent again. Since we still don't
190*4882a593Smuzhiyunkeep record of the partition-relative address, an operation is attributed to
191*4882a593Smuzhiyunthe partition which contains the first sector of the request after the
192*4882a593Smuzhiyuneventual merges. As requests can be merged across partition, this could lead
193*4882a593Smuzhiyunto some (probably insignificant) inaccuracy.
194*4882a593Smuzhiyun
195*4882a593SmuzhiyunAdditional notes
196*4882a593Smuzhiyun----------------
197*4882a593Smuzhiyun
198*4882a593SmuzhiyunIn 2.6+, sysfs is not mounted by default.  If your distribution of
199*4882a593SmuzhiyunLinux hasn't added it already, here's the line you'll want to add to
200*4882a593Smuzhiyunyour ``/etc/fstab``::
201*4882a593Smuzhiyun
202*4882a593Smuzhiyun	none /sys sysfs defaults 0 0
203*4882a593Smuzhiyun
204*4882a593Smuzhiyun
205*4882a593SmuzhiyunIn 2.6+, all disk statistics were removed from ``/proc/stat``.  In 2.4, they
206*4882a593Smuzhiyunappear in both ``/proc/partitions`` and ``/proc/stat``, although the ones in
207*4882a593Smuzhiyun``/proc/stat`` take a very different format from those in ``/proc/partitions``
208*4882a593Smuzhiyun(see proc(5), if your system has it.)
209*4882a593Smuzhiyun
210*4882a593Smuzhiyun-- ricklind@us.ibm.com
211