xref: /OK3568_Linux_fs/kernel/Documentation/watch_queue.rst (revision 4882a59341e53eb6f0b4789bf948001014eff981)
1*4882a593Smuzhiyun==============================
2*4882a593SmuzhiyunGeneral notification mechanism
3*4882a593Smuzhiyun==============================
4*4882a593Smuzhiyun
5*4882a593SmuzhiyunThe general notification mechanism is built on top of the standard pipe driver
6*4882a593Smuzhiyunwhereby it effectively splices notification messages from the kernel into pipes
7*4882a593Smuzhiyunopened by userspace.  This can be used in conjunction with::
8*4882a593Smuzhiyun
9*4882a593Smuzhiyun  * Key/keyring notifications
10*4882a593Smuzhiyun
11*4882a593Smuzhiyun
12*4882a593SmuzhiyunThe notifications buffers can be enabled by:
13*4882a593Smuzhiyun
14*4882a593Smuzhiyun	"General setup"/"General notification queue"
15*4882a593Smuzhiyun	(CONFIG_WATCH_QUEUE)
16*4882a593Smuzhiyun
17*4882a593SmuzhiyunThis document has the following sections:
18*4882a593Smuzhiyun
19*4882a593Smuzhiyun.. contents:: :local:
20*4882a593Smuzhiyun
21*4882a593Smuzhiyun
22*4882a593SmuzhiyunOverview
23*4882a593Smuzhiyun========
24*4882a593Smuzhiyun
25*4882a593SmuzhiyunThis facility appears as a pipe that is opened in a special mode.  The pipe's
26*4882a593Smuzhiyuninternal ring buffer is used to hold messages that are generated by the kernel.
27*4882a593SmuzhiyunThese messages are then read out by read().  Splice and similar are disabled on
28*4882a593Smuzhiyunsuch pipes due to them wanting to, under some circumstances, revert their
29*4882a593Smuzhiyunadditions to the ring - which might end up interleaved with notification
30*4882a593Smuzhiyunmessages.
31*4882a593Smuzhiyun
32*4882a593SmuzhiyunThe owner of the pipe has to tell the kernel which sources it would like to
33*4882a593Smuzhiyunwatch through that pipe.  Only sources that have been connected to a pipe will
34*4882a593Smuzhiyuninsert messages into it.  Note that a source may be bound to multiple pipes and
35*4882a593Smuzhiyuninsert messages into all of them simultaneously.
36*4882a593Smuzhiyun
37*4882a593SmuzhiyunFilters may also be emplaced on a pipe so that certain source types and
38*4882a593Smuzhiyunsubevents can be ignored if they're not of interest.
39*4882a593Smuzhiyun
40*4882a593SmuzhiyunA message will be discarded if there isn't a slot available in the ring or if
41*4882a593Smuzhiyunno preallocated message buffer is available.  In both of these cases, read()
42*4882a593Smuzhiyunwill insert a WATCH_META_LOSS_NOTIFICATION message into the output buffer after
43*4882a593Smuzhiyunthe last message currently in the buffer has been read.
44*4882a593Smuzhiyun
45*4882a593SmuzhiyunNote that when producing a notification, the kernel does not wait for the
46*4882a593Smuzhiyunconsumers to collect it, but rather just continues on.  This means that
47*4882a593Smuzhiyunnotifications can be generated whilst spinlocks are held and also protects the
48*4882a593Smuzhiyunkernel from being held up indefinitely by a userspace malfunction.
49*4882a593Smuzhiyun
50*4882a593Smuzhiyun
51*4882a593SmuzhiyunMessage Structure
52*4882a593Smuzhiyun=================
53*4882a593Smuzhiyun
54*4882a593SmuzhiyunNotification messages begin with a short header::
55*4882a593Smuzhiyun
56*4882a593Smuzhiyun	struct watch_notification {
57*4882a593Smuzhiyun		__u32	type:24;
58*4882a593Smuzhiyun		__u32	subtype:8;
59*4882a593Smuzhiyun		__u32	info;
60*4882a593Smuzhiyun	};
61*4882a593Smuzhiyun
62*4882a593Smuzhiyun"type" indicates the source of the notification record and "subtype" indicates
63*4882a593Smuzhiyunthe type of record from that source (see the Watch Sources section below).  The
64*4882a593Smuzhiyuntype may also be "WATCH_TYPE_META".  This is a special record type generated
65*4882a593Smuzhiyuninternally by the watch queue itself.  There are two subtypes:
66*4882a593Smuzhiyun
67*4882a593Smuzhiyun  * WATCH_META_REMOVAL_NOTIFICATION
68*4882a593Smuzhiyun  * WATCH_META_LOSS_NOTIFICATION
69*4882a593Smuzhiyun
70*4882a593SmuzhiyunThe first indicates that an object on which a watch was installed was removed
71*4882a593Smuzhiyunor destroyed and the second indicates that some messages have been lost.
72*4882a593Smuzhiyun
73*4882a593Smuzhiyun"info" indicates a bunch of things, including:
74*4882a593Smuzhiyun
75*4882a593Smuzhiyun  * The length of the message in bytes, including the header (mask with
76*4882a593Smuzhiyun    WATCH_INFO_LENGTH and shift by WATCH_INFO_LENGTH__SHIFT).  This indicates
77*4882a593Smuzhiyun    the size of the record, which may be between 8 and 127 bytes.
78*4882a593Smuzhiyun
79*4882a593Smuzhiyun  * The watch ID (mask with WATCH_INFO_ID and shift by WATCH_INFO_ID__SHIFT).
80*4882a593Smuzhiyun    This indicates that caller's ID of the watch, which may be between 0
81*4882a593Smuzhiyun    and 255.  Multiple watches may share a queue, and this provides a means to
82*4882a593Smuzhiyun    distinguish them.
83*4882a593Smuzhiyun
84*4882a593Smuzhiyun  * A type-specific field (WATCH_INFO_TYPE_INFO).  This is set by the
85*4882a593Smuzhiyun    notification producer to indicate some meaning specific to the type and
86*4882a593Smuzhiyun    subtype.
87*4882a593Smuzhiyun
88*4882a593SmuzhiyunEverything in info apart from the length can be used for filtering.
89*4882a593Smuzhiyun
90*4882a593SmuzhiyunThe header can be followed by supplementary information.  The format of this is
91*4882a593Smuzhiyunat the discretion is defined by the type and subtype.
92*4882a593Smuzhiyun
93*4882a593Smuzhiyun
94*4882a593SmuzhiyunWatch List (Notification Source) API
95*4882a593Smuzhiyun====================================
96*4882a593Smuzhiyun
97*4882a593SmuzhiyunA "watch list" is a list of watchers that are subscribed to a source of
98*4882a593Smuzhiyunnotifications.  A list may be attached to an object (say a key or a superblock)
99*4882a593Smuzhiyunor may be global (say for device events).  From a userspace perspective, a
100*4882a593Smuzhiyunnon-global watch list is typically referred to by reference to the object it
101*4882a593Smuzhiyunbelongs to (such as using KEYCTL_NOTIFY and giving it a key serial number to
102*4882a593Smuzhiyunwatch that specific key).
103*4882a593Smuzhiyun
104*4882a593SmuzhiyunTo manage a watch list, the following functions are provided:
105*4882a593Smuzhiyun
106*4882a593Smuzhiyun  * ::
107*4882a593Smuzhiyun
108*4882a593Smuzhiyun	void init_watch_list(struct watch_list *wlist,
109*4882a593Smuzhiyun			     void (*release_watch)(struct watch *wlist));
110*4882a593Smuzhiyun
111*4882a593Smuzhiyun    Initialise a watch list.  If ``release_watch`` is not NULL, then this
112*4882a593Smuzhiyun    indicates a function that should be called when the watch_list object is
113*4882a593Smuzhiyun    destroyed to discard any references the watch list holds on the watched
114*4882a593Smuzhiyun    object.
115*4882a593Smuzhiyun
116*4882a593Smuzhiyun  * ``void remove_watch_list(struct watch_list *wlist);``
117*4882a593Smuzhiyun
118*4882a593Smuzhiyun    This removes all of the watches subscribed to a watch_list and frees them
119*4882a593Smuzhiyun    and then destroys the watch_list object itself.
120*4882a593Smuzhiyun
121*4882a593Smuzhiyun
122*4882a593SmuzhiyunWatch Queue (Notification Output) API
123*4882a593Smuzhiyun=====================================
124*4882a593Smuzhiyun
125*4882a593SmuzhiyunA "watch queue" is the buffer allocated by an application that notification
126*4882a593Smuzhiyunrecords will be written into.  The workings of this are hidden entirely inside
127*4882a593Smuzhiyunof the pipe device driver, but it is necessary to gain a reference to it to set
128*4882a593Smuzhiyuna watch.  These can be managed with:
129*4882a593Smuzhiyun
130*4882a593Smuzhiyun  * ``struct watch_queue *get_watch_queue(int fd);``
131*4882a593Smuzhiyun
132*4882a593Smuzhiyun    Since watch queues are indicated to the kernel by the fd of the pipe that
133*4882a593Smuzhiyun    implements the buffer, userspace must hand that fd through a system call.
134*4882a593Smuzhiyun    This can be used to look up an opaque pointer to the watch queue from the
135*4882a593Smuzhiyun    system call.
136*4882a593Smuzhiyun
137*4882a593Smuzhiyun  * ``void put_watch_queue(struct watch_queue *wqueue);``
138*4882a593Smuzhiyun
139*4882a593Smuzhiyun    This discards the reference obtained from ``get_watch_queue()``.
140*4882a593Smuzhiyun
141*4882a593Smuzhiyun
142*4882a593SmuzhiyunWatch Subscription API
143*4882a593Smuzhiyun======================
144*4882a593Smuzhiyun
145*4882a593SmuzhiyunA "watch" is a subscription on a watch list, indicating the watch queue, and
146*4882a593Smuzhiyunthus the buffer, into which notification records should be written.  The watch
147*4882a593Smuzhiyunqueue object may also carry filtering rules for that object, as set by
148*4882a593Smuzhiyunuserspace.  Some parts of the watch struct can be set by the driver::
149*4882a593Smuzhiyun
150*4882a593Smuzhiyun	struct watch {
151*4882a593Smuzhiyun		union {
152*4882a593Smuzhiyun			u32		info_id;	/* ID to be OR'd in to info field */
153*4882a593Smuzhiyun			...
154*4882a593Smuzhiyun		};
155*4882a593Smuzhiyun		void			*private;	/* Private data for the watched object */
156*4882a593Smuzhiyun		u64			id;		/* Internal identifier */
157*4882a593Smuzhiyun		...
158*4882a593Smuzhiyun	};
159*4882a593Smuzhiyun
160*4882a593SmuzhiyunThe ``info_id`` value should be an 8-bit number obtained from userspace and
161*4882a593Smuzhiyunshifted by WATCH_INFO_ID__SHIFT.  This is OR'd into the WATCH_INFO_ID field of
162*4882a593Smuzhiyunstruct watch_notification::info when and if the notification is written into
163*4882a593Smuzhiyunthe associated watch queue buffer.
164*4882a593Smuzhiyun
165*4882a593SmuzhiyunThe ``private`` field is the driver's data associated with the watch_list and
166*4882a593Smuzhiyunis cleaned up by the ``watch_list::release_watch()`` method.
167*4882a593Smuzhiyun
168*4882a593SmuzhiyunThe ``id`` field is the source's ID.  Notifications that are posted with a
169*4882a593Smuzhiyundifferent ID are ignored.
170*4882a593Smuzhiyun
171*4882a593SmuzhiyunThe following functions are provided to manage watches:
172*4882a593Smuzhiyun
173*4882a593Smuzhiyun  * ``void init_watch(struct watch *watch, struct watch_queue *wqueue);``
174*4882a593Smuzhiyun
175*4882a593Smuzhiyun    Initialise a watch object, setting its pointer to the watch queue, using
176*4882a593Smuzhiyun    appropriate barriering to avoid lockdep complaints.
177*4882a593Smuzhiyun
178*4882a593Smuzhiyun  * ``int add_watch_to_object(struct watch *watch, struct watch_list *wlist);``
179*4882a593Smuzhiyun
180*4882a593Smuzhiyun    Subscribe a watch to a watch list (notification source).  The
181*4882a593Smuzhiyun    driver-settable fields in the watch struct must have been set before this
182*4882a593Smuzhiyun    is called.
183*4882a593Smuzhiyun
184*4882a593Smuzhiyun  * ::
185*4882a593Smuzhiyun
186*4882a593Smuzhiyun	int remove_watch_from_object(struct watch_list *wlist,
187*4882a593Smuzhiyun				     struct watch_queue *wqueue,
188*4882a593Smuzhiyun				     u64 id, false);
189*4882a593Smuzhiyun
190*4882a593Smuzhiyun    Remove a watch from a watch list, where the watch must match the specified
191*4882a593Smuzhiyun    watch queue (``wqueue``) and object identifier (``id``).  A notification
192*4882a593Smuzhiyun    (``WATCH_META_REMOVAL_NOTIFICATION``) is sent to the watch queue to
193*4882a593Smuzhiyun    indicate that the watch got removed.
194*4882a593Smuzhiyun
195*4882a593Smuzhiyun  * ``int remove_watch_from_object(struct watch_list *wlist, NULL, 0, true);``
196*4882a593Smuzhiyun
197*4882a593Smuzhiyun    Remove all the watches from a watch list.  It is expected that this will be
198*4882a593Smuzhiyun    called preparatory to destruction and that the watch list will be
199*4882a593Smuzhiyun    inaccessible to new watches by this point.  A notification
200*4882a593Smuzhiyun    (``WATCH_META_REMOVAL_NOTIFICATION``) is sent to the watch queue of each
201*4882a593Smuzhiyun    subscribed watch to indicate that the watch got removed.
202*4882a593Smuzhiyun
203*4882a593Smuzhiyun
204*4882a593SmuzhiyunNotification Posting API
205*4882a593Smuzhiyun========================
206*4882a593Smuzhiyun
207*4882a593SmuzhiyunTo post a notification to watch list so that the subscribed watches can see it,
208*4882a593Smuzhiyunthe following function should be used::
209*4882a593Smuzhiyun
210*4882a593Smuzhiyun	void post_watch_notification(struct watch_list *wlist,
211*4882a593Smuzhiyun				     struct watch_notification *n,
212*4882a593Smuzhiyun				     const struct cred *cred,
213*4882a593Smuzhiyun				     u64 id);
214*4882a593Smuzhiyun
215*4882a593SmuzhiyunThe notification should be preformatted and a pointer to the header (``n``)
216*4882a593Smuzhiyunshould be passed in.  The notification may be larger than this and the size in
217*4882a593Smuzhiyununits of buffer slots is noted in ``n->info & WATCH_INFO_LENGTH``.
218*4882a593Smuzhiyun
219*4882a593SmuzhiyunThe ``cred`` struct indicates the credentials of the source (subject) and is
220*4882a593Smuzhiyunpassed to the LSMs, such as SELinux, to allow or suppress the recording of the
221*4882a593Smuzhiyunnote in each individual queue according to the credentials of that queue
222*4882a593Smuzhiyun(object).
223*4882a593Smuzhiyun
224*4882a593SmuzhiyunThe ``id`` is the ID of the source object (such as the serial number on a key).
225*4882a593SmuzhiyunOnly watches that have the same ID set in them will see this notification.
226*4882a593Smuzhiyun
227*4882a593Smuzhiyun
228*4882a593SmuzhiyunWatch Sources
229*4882a593Smuzhiyun=============
230*4882a593Smuzhiyun
231*4882a593SmuzhiyunAny particular buffer can be fed from multiple sources.  Sources include:
232*4882a593Smuzhiyun
233*4882a593Smuzhiyun  * WATCH_TYPE_KEY_NOTIFY
234*4882a593Smuzhiyun
235*4882a593Smuzhiyun    Notifications of this type indicate changes to keys and keyrings, including
236*4882a593Smuzhiyun    the changes of keyring contents or the attributes of keys.
237*4882a593Smuzhiyun
238*4882a593Smuzhiyun    See Documentation/security/keys/core.rst for more information.
239*4882a593Smuzhiyun
240*4882a593Smuzhiyun
241*4882a593SmuzhiyunEvent Filtering
242*4882a593Smuzhiyun===============
243*4882a593Smuzhiyun
244*4882a593SmuzhiyunOnce a watch queue has been created, a set of filters can be applied to limit
245*4882a593Smuzhiyunthe events that are received using::
246*4882a593Smuzhiyun
247*4882a593Smuzhiyun	struct watch_notification_filter filter = {
248*4882a593Smuzhiyun		...
249*4882a593Smuzhiyun	};
250*4882a593Smuzhiyun	ioctl(fd, IOC_WATCH_QUEUE_SET_FILTER, &filter)
251*4882a593Smuzhiyun
252*4882a593SmuzhiyunThe filter description is a variable of type::
253*4882a593Smuzhiyun
254*4882a593Smuzhiyun	struct watch_notification_filter {
255*4882a593Smuzhiyun		__u32	nr_filters;
256*4882a593Smuzhiyun		__u32	__reserved;
257*4882a593Smuzhiyun		struct watch_notification_type_filter filters[];
258*4882a593Smuzhiyun	};
259*4882a593Smuzhiyun
260*4882a593SmuzhiyunWhere "nr_filters" is the number of filters in filters[] and "__reserved"
261*4882a593Smuzhiyunshould be 0.  The "filters" array has elements of the following type::
262*4882a593Smuzhiyun
263*4882a593Smuzhiyun	struct watch_notification_type_filter {
264*4882a593Smuzhiyun		__u32	type;
265*4882a593Smuzhiyun		__u32	info_filter;
266*4882a593Smuzhiyun		__u32	info_mask;
267*4882a593Smuzhiyun		__u32	subtype_filter[8];
268*4882a593Smuzhiyun	};
269*4882a593Smuzhiyun
270*4882a593SmuzhiyunWhere:
271*4882a593Smuzhiyun
272*4882a593Smuzhiyun  * ``type`` is the event type to filter for and should be something like
273*4882a593Smuzhiyun    "WATCH_TYPE_KEY_NOTIFY"
274*4882a593Smuzhiyun
275*4882a593Smuzhiyun  * ``info_filter`` and ``info_mask`` act as a filter on the info field of the
276*4882a593Smuzhiyun    notification record.  The notification is only written into the buffer if::
277*4882a593Smuzhiyun
278*4882a593Smuzhiyun	(watch.info & info_mask) == info_filter
279*4882a593Smuzhiyun
280*4882a593Smuzhiyun    This could be used, for example, to ignore events that are not exactly on
281*4882a593Smuzhiyun    the watched point in a mount tree.
282*4882a593Smuzhiyun
283*4882a593Smuzhiyun  * ``subtype_filter`` is a bitmask indicating the subtypes that are of
284*4882a593Smuzhiyun    interest.  Bit 0 of subtype_filter[0] corresponds to subtype 0, bit 1 to
285*4882a593Smuzhiyun    subtype 1, and so on.
286*4882a593Smuzhiyun
287*4882a593SmuzhiyunIf the argument to the ioctl() is NULL, then the filters will be removed and
288*4882a593Smuzhiyunall events from the watched sources will come through.
289*4882a593Smuzhiyun
290*4882a593Smuzhiyun
291*4882a593SmuzhiyunUserspace Code Example
292*4882a593Smuzhiyun======================
293*4882a593Smuzhiyun
294*4882a593SmuzhiyunA buffer is created with something like the following::
295*4882a593Smuzhiyun
296*4882a593Smuzhiyun	pipe2(fds, O_TMPFILE);
297*4882a593Smuzhiyun	ioctl(fds[1], IOC_WATCH_QUEUE_SET_SIZE, 256);
298*4882a593Smuzhiyun
299*4882a593SmuzhiyunIt can then be set to receive keyring change notifications::
300*4882a593Smuzhiyun
301*4882a593Smuzhiyun	keyctl(KEYCTL_WATCH_KEY, KEY_SPEC_SESSION_KEYRING, fds[1], 0x01);
302*4882a593Smuzhiyun
303*4882a593SmuzhiyunThe notifications can then be consumed by something like the following::
304*4882a593Smuzhiyun
305*4882a593Smuzhiyun	static void consumer(int rfd, struct watch_queue_buffer *buf)
306*4882a593Smuzhiyun	{
307*4882a593Smuzhiyun		unsigned char buffer[128];
308*4882a593Smuzhiyun		ssize_t buf_len;
309*4882a593Smuzhiyun
310*4882a593Smuzhiyun		while (buf_len = read(rfd, buffer, sizeof(buffer)),
311*4882a593Smuzhiyun		       buf_len > 0
312*4882a593Smuzhiyun		       ) {
313*4882a593Smuzhiyun			void *p = buffer;
314*4882a593Smuzhiyun			void *end = buffer + buf_len;
315*4882a593Smuzhiyun			while (p < end) {
316*4882a593Smuzhiyun				union {
317*4882a593Smuzhiyun					struct watch_notification n;
318*4882a593Smuzhiyun					unsigned char buf1[128];
319*4882a593Smuzhiyun				} n;
320*4882a593Smuzhiyun				size_t largest, len;
321*4882a593Smuzhiyun
322*4882a593Smuzhiyun				largest = end - p;
323*4882a593Smuzhiyun				if (largest > 128)
324*4882a593Smuzhiyun					largest = 128;
325*4882a593Smuzhiyun				memcpy(&n, p, largest);
326*4882a593Smuzhiyun
327*4882a593Smuzhiyun				len = (n->info & WATCH_INFO_LENGTH) >>
328*4882a593Smuzhiyun					WATCH_INFO_LENGTH__SHIFT;
329*4882a593Smuzhiyun				if (len == 0 || len > largest)
330*4882a593Smuzhiyun					return;
331*4882a593Smuzhiyun
332*4882a593Smuzhiyun				switch (n.n.type) {
333*4882a593Smuzhiyun				case WATCH_TYPE_META:
334*4882a593Smuzhiyun					got_meta(&n.n);
335*4882a593Smuzhiyun				case WATCH_TYPE_KEY_NOTIFY:
336*4882a593Smuzhiyun					saw_key_change(&n.n);
337*4882a593Smuzhiyun					break;
338*4882a593Smuzhiyun				}
339*4882a593Smuzhiyun
340*4882a593Smuzhiyun				p += len;
341*4882a593Smuzhiyun			}
342*4882a593Smuzhiyun		}
343*4882a593Smuzhiyun	}
344