xref: /OK3568_Linux_fs/kernel/Documentation/accounting/taskstats.rst (revision 4882a59341e53eb6f0b4789bf948001014eff981)
1*4882a593Smuzhiyun=============================
2*4882a593SmuzhiyunPer-task statistics interface
3*4882a593Smuzhiyun=============================
4*4882a593Smuzhiyun
5*4882a593Smuzhiyun
6*4882a593SmuzhiyunTaskstats is a netlink-based interface for sending per-task and
7*4882a593Smuzhiyunper-process statistics from the kernel to userspace.
8*4882a593Smuzhiyun
9*4882a593SmuzhiyunTaskstats was designed for the following benefits:
10*4882a593Smuzhiyun
11*4882a593Smuzhiyun- efficiently provide statistics during lifetime of a task and on its exit
12*4882a593Smuzhiyun- unified interface for multiple accounting subsystems
13*4882a593Smuzhiyun- extensibility for use by future accounting patches
14*4882a593Smuzhiyun
15*4882a593SmuzhiyunTerminology
16*4882a593Smuzhiyun-----------
17*4882a593Smuzhiyun
18*4882a593Smuzhiyun"pid", "tid" and "task" are used interchangeably and refer to the standard
19*4882a593SmuzhiyunLinux task defined by struct task_struct.  per-pid stats are the same as
20*4882a593Smuzhiyunper-task stats.
21*4882a593Smuzhiyun
22*4882a593Smuzhiyun"tgid", "process" and "thread group" are used interchangeably and refer to the
23*4882a593Smuzhiyuntasks that share an mm_struct i.e. the traditional Unix process. Despite the
24*4882a593Smuzhiyunuse of tgid, there is no special treatment for the task that is thread group
25*4882a593Smuzhiyunleader - a process is deemed alive as long as it has any task belonging to it.
26*4882a593Smuzhiyun
27*4882a593SmuzhiyunUsage
28*4882a593Smuzhiyun-----
29*4882a593Smuzhiyun
30*4882a593SmuzhiyunTo get statistics during a task's lifetime, userspace opens a unicast netlink
31*4882a593Smuzhiyunsocket (NETLINK_GENERIC family) and sends commands specifying a pid or a tgid.
32*4882a593SmuzhiyunThe response contains statistics for a task (if pid is specified) or the sum of
33*4882a593Smuzhiyunstatistics for all tasks of the process (if tgid is specified).
34*4882a593Smuzhiyun
35*4882a593SmuzhiyunTo obtain statistics for tasks which are exiting, the userspace listener
36*4882a593Smuzhiyunsends a register command and specifies a cpumask. Whenever a task exits on
37*4882a593Smuzhiyunone of the cpus in the cpumask, its per-pid statistics are sent to the
38*4882a593Smuzhiyunregistered listener. Using cpumasks allows the data received by one listener
39*4882a593Smuzhiyunto be limited and assists in flow control over the netlink interface and is
40*4882a593Smuzhiyunexplained in more detail below.
41*4882a593Smuzhiyun
42*4882a593SmuzhiyunIf the exiting task is the last thread exiting its thread group,
43*4882a593Smuzhiyunan additional record containing the per-tgid stats is also sent to userspace.
44*4882a593SmuzhiyunThe latter contains the sum of per-pid stats for all threads in the thread
45*4882a593Smuzhiyungroup, both past and present.
46*4882a593Smuzhiyun
47*4882a593Smuzhiyungetdelays.c is a simple utility demonstrating usage of the taskstats interface
48*4882a593Smuzhiyunfor reporting delay accounting statistics. Users can register cpumasks,
49*4882a593Smuzhiyunsend commands and process responses, listen for per-tid/tgid exit data,
50*4882a593Smuzhiyunwrite the data received to a file and do basic flow control by increasing
51*4882a593Smuzhiyunreceive buffer sizes.
52*4882a593Smuzhiyun
53*4882a593SmuzhiyunInterface
54*4882a593Smuzhiyun---------
55*4882a593Smuzhiyun
56*4882a593SmuzhiyunThe user-kernel interface is encapsulated in include/linux/taskstats.h
57*4882a593Smuzhiyun
58*4882a593SmuzhiyunTo avoid this documentation becoming obsolete as the interface evolves, only
59*4882a593Smuzhiyunan outline of the current version is given. taskstats.h always overrides the
60*4882a593Smuzhiyundescription here.
61*4882a593Smuzhiyun
62*4882a593Smuzhiyunstruct taskstats is the common accounting structure for both per-pid and
63*4882a593Smuzhiyunper-tgid data. It is versioned and can be extended by each accounting subsystem
64*4882a593Smuzhiyunthat is added to the kernel. The fields and their semantics are defined in the
65*4882a593Smuzhiyuntaskstats.h file.
66*4882a593Smuzhiyun
67*4882a593SmuzhiyunThe data exchanged between user and kernel space is a netlink message belonging
68*4882a593Smuzhiyunto the NETLINK_GENERIC family and using the netlink attributes interface.
69*4882a593SmuzhiyunThe messages are in the format::
70*4882a593Smuzhiyun
71*4882a593Smuzhiyun    +----------+- - -+-------------+-------------------+
72*4882a593Smuzhiyun    | nlmsghdr | Pad |  genlmsghdr | taskstats payload |
73*4882a593Smuzhiyun    +----------+- - -+-------------+-------------------+
74*4882a593Smuzhiyun
75*4882a593Smuzhiyun
76*4882a593SmuzhiyunThe taskstats payload is one of the following three kinds:
77*4882a593Smuzhiyun
78*4882a593Smuzhiyun1. Commands: Sent from user to kernel. Commands to get data on
79*4882a593Smuzhiyuna pid/tgid consist of one attribute, of type TASKSTATS_CMD_ATTR_PID/TGID,
80*4882a593Smuzhiyuncontaining a u32 pid or tgid in the attribute payload. The pid/tgid denotes
81*4882a593Smuzhiyunthe task/process for which userspace wants statistics.
82*4882a593Smuzhiyun
83*4882a593SmuzhiyunCommands to register/deregister interest in exit data from a set of cpus
84*4882a593Smuzhiyunconsist of one attribute, of type
85*4882a593SmuzhiyunTASKSTATS_CMD_ATTR_REGISTER/DEREGISTER_CPUMASK and contain a cpumask in the
86*4882a593Smuzhiyunattribute payload. The cpumask is specified as an ascii string of
87*4882a593Smuzhiyuncomma-separated cpu ranges e.g. to listen to exit data from cpus 1,2,3,5,7,8
88*4882a593Smuzhiyunthe cpumask would be "1-3,5,7-8". If userspace forgets to deregister interest
89*4882a593Smuzhiyunin cpus before closing the listening socket, the kernel cleans up its interest
90*4882a593Smuzhiyunset over time. However, for the sake of efficiency, an explicit deregistration
91*4882a593Smuzhiyunis advisable.
92*4882a593Smuzhiyun
93*4882a593Smuzhiyun2. Response for a command: sent from the kernel in response to a userspace
94*4882a593Smuzhiyuncommand. The payload is a series of three attributes of type:
95*4882a593Smuzhiyun
96*4882a593Smuzhiyuna) TASKSTATS_TYPE_AGGR_PID/TGID : attribute containing no payload but indicates
97*4882a593Smuzhiyuna pid/tgid will be followed by some stats.
98*4882a593Smuzhiyun
99*4882a593Smuzhiyunb) TASKSTATS_TYPE_PID/TGID: attribute whose payload is the pid/tgid whose stats
100*4882a593Smuzhiyunare being returned.
101*4882a593Smuzhiyun
102*4882a593Smuzhiyunc) TASKSTATS_TYPE_STATS: attribute with a struct taskstats as payload. The
103*4882a593Smuzhiyunsame structure is used for both per-pid and per-tgid stats.
104*4882a593Smuzhiyun
105*4882a593Smuzhiyun3. New message sent by kernel whenever a task exits. The payload consists of a
106*4882a593Smuzhiyun   series of attributes of the following type:
107*4882a593Smuzhiyun
108*4882a593Smuzhiyuna) TASKSTATS_TYPE_AGGR_PID: indicates next two attributes will be pid+stats
109*4882a593Smuzhiyunb) TASKSTATS_TYPE_PID: contains exiting task's pid
110*4882a593Smuzhiyunc) TASKSTATS_TYPE_STATS: contains the exiting task's per-pid stats
111*4882a593Smuzhiyund) TASKSTATS_TYPE_AGGR_TGID: indicates next two attributes will be tgid+stats
112*4882a593Smuzhiyune) TASKSTATS_TYPE_TGID: contains tgid of process to which task belongs
113*4882a593Smuzhiyunf) TASKSTATS_TYPE_STATS: contains the per-tgid stats for exiting task's process
114*4882a593Smuzhiyun
115*4882a593Smuzhiyun
116*4882a593Smuzhiyunper-tgid stats
117*4882a593Smuzhiyun--------------
118*4882a593Smuzhiyun
119*4882a593SmuzhiyunTaskstats provides per-process stats, in addition to per-task stats, since
120*4882a593Smuzhiyunresource management is often done at a process granularity and aggregating task
121*4882a593Smuzhiyunstats in userspace alone is inefficient and potentially inaccurate (due to lack
122*4882a593Smuzhiyunof atomicity).
123*4882a593Smuzhiyun
124*4882a593SmuzhiyunHowever, maintaining per-process, in addition to per-task stats, within the
125*4882a593Smuzhiyunkernel has space and time overheads. To address this, the taskstats code
126*4882a593Smuzhiyunaccumulates each exiting task's statistics into a process-wide data structure.
127*4882a593SmuzhiyunWhen the last task of a process exits, the process level data accumulated also
128*4882a593Smuzhiyungets sent to userspace (along with the per-task data).
129*4882a593Smuzhiyun
130*4882a593SmuzhiyunWhen a user queries to get per-tgid data, the sum of all other live threads in
131*4882a593Smuzhiyunthe group is added up and added to the accumulated total for previously exited
132*4882a593Smuzhiyunthreads of the same thread group.
133*4882a593Smuzhiyun
134*4882a593SmuzhiyunExtending taskstats
135*4882a593Smuzhiyun-------------------
136*4882a593Smuzhiyun
137*4882a593SmuzhiyunThere are two ways to extend the taskstats interface to export more
138*4882a593Smuzhiyunper-task/process stats as patches to collect them get added to the kernel
139*4882a593Smuzhiyunin future:
140*4882a593Smuzhiyun
141*4882a593Smuzhiyun1. Adding more fields to the end of the existing struct taskstats. Backward
142*4882a593Smuzhiyun   compatibility is ensured by the version number within the
143*4882a593Smuzhiyun   structure. Userspace will use only the fields of the struct that correspond
144*4882a593Smuzhiyun   to the version its using.
145*4882a593Smuzhiyun
146*4882a593Smuzhiyun2. Defining separate statistic structs and using the netlink attributes
147*4882a593Smuzhiyun   interface to return them. Since userspace processes each netlink attribute
148*4882a593Smuzhiyun   independently, it can always ignore attributes whose type it does not
149*4882a593Smuzhiyun   understand (because it is using an older version of the interface).
150*4882a593Smuzhiyun
151*4882a593Smuzhiyun
152*4882a593SmuzhiyunChoosing between 1. and 2. is a matter of trading off flexibility and
153*4882a593Smuzhiyunoverhead. If only a few fields need to be added, then 1. is the preferable
154*4882a593Smuzhiyunpath since the kernel and userspace don't need to incur the overhead of
155*4882a593Smuzhiyunprocessing new netlink attributes. But if the new fields expand the existing
156*4882a593Smuzhiyunstruct too much, requiring disparate userspace accounting utilities to
157*4882a593Smuzhiyununnecessarily receive large structures whose fields are of no interest, then
158*4882a593Smuzhiyunextending the attributes structure would be worthwhile.
159*4882a593Smuzhiyun
160*4882a593SmuzhiyunFlow control for taskstats
161*4882a593Smuzhiyun--------------------------
162*4882a593Smuzhiyun
163*4882a593SmuzhiyunWhen the rate of task exits becomes large, a listener may not be able to keep
164*4882a593Smuzhiyunup with the kernel's rate of sending per-tid/tgid exit data leading to data
165*4882a593Smuzhiyunloss. This possibility gets compounded when the taskstats structure gets
166*4882a593Smuzhiyunextended and the number of cpus grows large.
167*4882a593Smuzhiyun
168*4882a593SmuzhiyunTo avoid losing statistics, userspace should do one or more of the following:
169*4882a593Smuzhiyun
170*4882a593Smuzhiyun- increase the receive buffer sizes for the netlink sockets opened by
171*4882a593Smuzhiyun  listeners to receive exit data.
172*4882a593Smuzhiyun
173*4882a593Smuzhiyun- create more listeners and reduce the number of cpus being listened to by
174*4882a593Smuzhiyun  each listener. In the extreme case, there could be one listener for each cpu.
175*4882a593Smuzhiyun  Users may also consider setting the cpu affinity of the listener to the subset
176*4882a593Smuzhiyun  of cpus to which it listens, especially if they are listening to just one cpu.
177*4882a593Smuzhiyun
178*4882a593SmuzhiyunDespite these measures, if the userspace receives ENOBUFS error messages
179*4882a593Smuzhiyunindicated overflow of receive buffers, it should take measures to handle the
180*4882a593Smuzhiyunloss of data.
181