xref: /OK3568_Linux_fs/kernel/Documentation/bpf/ringbuf.rst (revision 4882a59341e53eb6f0b4789bf948001014eff981)
1*4882a593Smuzhiyun===============
2*4882a593SmuzhiyunBPF ring buffer
3*4882a593Smuzhiyun===============
4*4882a593Smuzhiyun
5*4882a593SmuzhiyunThis document describes BPF ring buffer design, API, and implementation details.
6*4882a593Smuzhiyun
7*4882a593Smuzhiyun.. contents::
8*4882a593Smuzhiyun    :local:
9*4882a593Smuzhiyun    :depth: 2
10*4882a593Smuzhiyun
11*4882a593SmuzhiyunMotivation
12*4882a593Smuzhiyun----------
13*4882a593Smuzhiyun
14*4882a593SmuzhiyunThere are two distinctive motivators for this work, which are not satisfied by
15*4882a593Smuzhiyunexisting perf buffer, which prompted creation of a new ring buffer
16*4882a593Smuzhiyunimplementation.
17*4882a593Smuzhiyun
18*4882a593Smuzhiyun- more efficient memory utilization by sharing ring buffer across CPUs;
19*4882a593Smuzhiyun- preserving ordering of events that happen sequentially in time, even across
20*4882a593Smuzhiyun  multiple CPUs (e.g., fork/exec/exit events for a task).
21*4882a593Smuzhiyun
22*4882a593SmuzhiyunThese two problems are independent, but perf buffer fails to satisfy both.
23*4882a593SmuzhiyunBoth are a result of a choice to have per-CPU perf ring buffer.  Both can be
24*4882a593Smuzhiyunalso solved by having an MPSC implementation of ring buffer. The ordering
25*4882a593Smuzhiyunproblem could technically be solved for perf buffer with some in-kernel
26*4882a593Smuzhiyuncounting, but given the first one requires an MPSC buffer, the same solution
27*4882a593Smuzhiyunwould solve the second problem automatically.
28*4882a593Smuzhiyun
29*4882a593SmuzhiyunSemantics and APIs
30*4882a593Smuzhiyun------------------
31*4882a593Smuzhiyun
32*4882a593SmuzhiyunSingle ring buffer is presented to BPF programs as an instance of BPF map of
33*4882a593Smuzhiyuntype ``BPF_MAP_TYPE_RINGBUF``. Two other alternatives considered, but
34*4882a593Smuzhiyunultimately rejected.
35*4882a593Smuzhiyun
36*4882a593SmuzhiyunOne way would be to, similar to ``BPF_MAP_TYPE_PERF_EVENT_ARRAY``, make
37*4882a593Smuzhiyun``BPF_MAP_TYPE_RINGBUF`` could represent an array of ring buffers, but not
38*4882a593Smuzhiyunenforce "same CPU only" rule. This would be more familiar interface compatible
39*4882a593Smuzhiyunwith existing perf buffer use in BPF, but would fail if application needed more
40*4882a593Smuzhiyunadvanced logic to lookup ring buffer by arbitrary key.
41*4882a593Smuzhiyun``BPF_MAP_TYPE_HASH_OF_MAPS`` addresses this with current approach.
42*4882a593SmuzhiyunAdditionally, given the performance of BPF ringbuf, many use cases would just
43*4882a593Smuzhiyunopt into a simple single ring buffer shared among all CPUs, for which current
44*4882a593Smuzhiyunapproach would be an overkill.
45*4882a593Smuzhiyun
46*4882a593SmuzhiyunAnother approach could introduce a new concept, alongside BPF map, to represent
47*4882a593Smuzhiyungeneric "container" object, which doesn't necessarily have key/value interface
48*4882a593Smuzhiyunwith lookup/update/delete operations. This approach would add a lot of extra
49*4882a593Smuzhiyuninfrastructure that has to be built for observability and verifier support. It
50*4882a593Smuzhiyunwould also add another concept that BPF developers would have to familiarize
51*4882a593Smuzhiyunthemselves with, new syntax in libbpf, etc. But then would really provide no
52*4882a593Smuzhiyunadditional benefits over the approach of using a map.  ``BPF_MAP_TYPE_RINGBUF``
53*4882a593Smuzhiyundoesn't support lookup/update/delete operations, but so doesn't few other map
54*4882a593Smuzhiyuntypes (e.g., queue and stack; array doesn't support delete, etc).
55*4882a593Smuzhiyun
56*4882a593SmuzhiyunThe approach chosen has an advantage of re-using existing BPF map
57*4882a593Smuzhiyuninfrastructure (introspection APIs in kernel, libbpf support, etc), being
58*4882a593Smuzhiyunfamiliar concept (no need to teach users a new type of object in BPF program),
59*4882a593Smuzhiyunand utilizing existing tooling (bpftool). For common scenario of using a single
60*4882a593Smuzhiyunring buffer for all CPUs, it's as simple and straightforward, as would be with
61*4882a593Smuzhiyuna dedicated "container" object. On the other hand, by being a map, it can be
62*4882a593Smuzhiyuncombined with ``ARRAY_OF_MAPS`` and ``HASH_OF_MAPS`` map-in-maps to implement
63*4882a593Smuzhiyuna wide variety of topologies, from one ring buffer for each CPU (e.g., as
64*4882a593Smuzhiyuna replacement for perf buffer use cases), to a complicated application
65*4882a593Smuzhiyunhashing/sharding of ring buffers (e.g., having a small pool of ring buffers
66*4882a593Smuzhiyunwith hashed task's tgid being a look up key to preserve order, but reduce
67*4882a593Smuzhiyuncontention).
68*4882a593Smuzhiyun
69*4882a593SmuzhiyunKey and value sizes are enforced to be zero. ``max_entries`` is used to specify
70*4882a593Smuzhiyunthe size of ring buffer and has to be a power of 2 value.
71*4882a593Smuzhiyun
72*4882a593SmuzhiyunThere are a bunch of similarities between perf buffer
73*4882a593Smuzhiyun(``BPF_MAP_TYPE_PERF_EVENT_ARRAY``) and new BPF ring buffer semantics:
74*4882a593Smuzhiyun
75*4882a593Smuzhiyun- variable-length records;
76*4882a593Smuzhiyun- if there is no more space left in ring buffer, reservation fails, no
77*4882a593Smuzhiyun  blocking;
78*4882a593Smuzhiyun- memory-mappable data area for user-space applications for ease of
79*4882a593Smuzhiyun  consumption and high performance;
80*4882a593Smuzhiyun- epoll notifications for new incoming data;
81*4882a593Smuzhiyun- but still the ability to do busy polling for new data to achieve the
82*4882a593Smuzhiyun  lowest latency, if necessary.
83*4882a593Smuzhiyun
84*4882a593SmuzhiyunBPF ringbuf provides two sets of APIs to BPF programs:
85*4882a593Smuzhiyun
86*4882a593Smuzhiyun- ``bpf_ringbuf_output()`` allows to *copy* data from one place to a ring
87*4882a593Smuzhiyun  buffer, similarly to ``bpf_perf_event_output()``;
88*4882a593Smuzhiyun- ``bpf_ringbuf_reserve()``/``bpf_ringbuf_commit()``/``bpf_ringbuf_discard()``
89*4882a593Smuzhiyun  APIs split the whole process into two steps. First, a fixed amount of space
90*4882a593Smuzhiyun  is reserved. If successful, a pointer to a data inside ring buffer data
91*4882a593Smuzhiyun  area is returned, which BPF programs can use similarly to a data inside
92*4882a593Smuzhiyun  array/hash maps. Once ready, this piece of memory is either committed or
93*4882a593Smuzhiyun  discarded. Discard is similar to commit, but makes consumer ignore the
94*4882a593Smuzhiyun  record.
95*4882a593Smuzhiyun
96*4882a593Smuzhiyun``bpf_ringbuf_output()`` has disadvantage of incurring extra memory copy,
97*4882a593Smuzhiyunbecause record has to be prepared in some other place first. But it allows to
98*4882a593Smuzhiyunsubmit records of the length that's not known to verifier beforehand. It also
99*4882a593Smuzhiyunclosely matches ``bpf_perf_event_output()``, so will simplify migration
100*4882a593Smuzhiyunsignificantly.
101*4882a593Smuzhiyun
102*4882a593Smuzhiyun``bpf_ringbuf_reserve()`` avoids the extra copy of memory by providing a memory
103*4882a593Smuzhiyunpointer directly to ring buffer memory. In a lot of cases records are larger
104*4882a593Smuzhiyunthan BPF stack space allows, so many programs have use extra per-CPU array as
105*4882a593Smuzhiyuna temporary heap for preparing sample. bpf_ringbuf_reserve() avoid this needs
106*4882a593Smuzhiyuncompletely. But in exchange, it only allows a known constant size of memory to
107*4882a593Smuzhiyunbe reserved, such that verifier can verify that BPF program can't access memory
108*4882a593Smuzhiyunoutside its reserved record space. bpf_ringbuf_output(), while slightly slower
109*4882a593Smuzhiyundue to extra memory copy, covers some use cases that are not suitable for
110*4882a593Smuzhiyun``bpf_ringbuf_reserve()``.
111*4882a593Smuzhiyun
112*4882a593SmuzhiyunThe difference between commit and discard is very small. Discard just marks
113*4882a593Smuzhiyuna record as discarded, and such records are supposed to be ignored by consumer
114*4882a593Smuzhiyuncode. Discard is useful for some advanced use-cases, such as ensuring
115*4882a593Smuzhiyunall-or-nothing multi-record submission, or emulating temporary
116*4882a593Smuzhiyun``malloc()``/``free()`` within single BPF program invocation.
117*4882a593Smuzhiyun
118*4882a593SmuzhiyunEach reserved record is tracked by verifier through existing
119*4882a593Smuzhiyunreference-tracking logic, similar to socket ref-tracking. It is thus
120*4882a593Smuzhiyunimpossible to reserve a record, but forget to submit (or discard) it.
121*4882a593Smuzhiyun
122*4882a593Smuzhiyun``bpf_ringbuf_query()`` helper allows to query various properties of ring
123*4882a593Smuzhiyunbuffer.  Currently 4 are supported:
124*4882a593Smuzhiyun
125*4882a593Smuzhiyun- ``BPF_RB_AVAIL_DATA`` returns amount of unconsumed data in ring buffer;
126*4882a593Smuzhiyun- ``BPF_RB_RING_SIZE`` returns the size of ring buffer;
127*4882a593Smuzhiyun- ``BPF_RB_CONS_POS``/``BPF_RB_PROD_POS`` returns current logical possition
128*4882a593Smuzhiyun  of consumer/producer, respectively.
129*4882a593Smuzhiyun
130*4882a593SmuzhiyunReturned values are momentarily snapshots of ring buffer state and could be
131*4882a593Smuzhiyunoff by the time helper returns, so this should be used only for
132*4882a593Smuzhiyundebugging/reporting reasons or for implementing various heuristics, that take
133*4882a593Smuzhiyuninto account highly-changeable nature of some of those characteristics.
134*4882a593Smuzhiyun
135*4882a593SmuzhiyunOne such heuristic might involve more fine-grained control over poll/epoll
136*4882a593Smuzhiyunnotifications about new data availability in ring buffer. Together with
137*4882a593Smuzhiyun``BPF_RB_NO_WAKEUP``/``BPF_RB_FORCE_WAKEUP`` flags for output/commit/discard
138*4882a593Smuzhiyunhelpers, it allows BPF program a high degree of control and, e.g., more
139*4882a593Smuzhiyunefficient batched notifications. Default self-balancing strategy, though,
140*4882a593Smuzhiyunshould be adequate for most applications and will work reliable and efficiently
141*4882a593Smuzhiyunalready.
142*4882a593Smuzhiyun
143*4882a593SmuzhiyunDesign and Implementation
144*4882a593Smuzhiyun-------------------------
145*4882a593Smuzhiyun
146*4882a593SmuzhiyunThis reserve/commit schema allows a natural way for multiple producers, either
147*4882a593Smuzhiyunon different CPUs or even on the same CPU/in the same BPF program, to reserve
148*4882a593Smuzhiyunindependent records and work with them without blocking other producers. This
149*4882a593Smuzhiyunmeans that if BPF program was interruped by another BPF program sharing the
150*4882a593Smuzhiyunsame ring buffer, they will both get a record reserved (provided there is
151*4882a593Smuzhiyunenough space left) and can work with it and submit it independently. This
152*4882a593Smuzhiyunapplies to NMI context as well, except that due to using a spinlock during
153*4882a593Smuzhiyunreservation, in NMI context, ``bpf_ringbuf_reserve()`` might fail to get
154*4882a593Smuzhiyuna lock, in which case reservation will fail even if ring buffer is not full.
155*4882a593Smuzhiyun
156*4882a593SmuzhiyunThe ring buffer itself internally is implemented as a power-of-2 sized
157*4882a593Smuzhiyuncircular buffer, with two logical and ever-increasing counters (which might
158*4882a593Smuzhiyunwrap around on 32-bit architectures, that's not a problem):
159*4882a593Smuzhiyun
160*4882a593Smuzhiyun- consumer counter shows up to which logical position consumer consumed the
161*4882a593Smuzhiyun  data;
162*4882a593Smuzhiyun- producer counter denotes amount of data reserved by all producers.
163*4882a593Smuzhiyun
164*4882a593SmuzhiyunEach time a record is reserved, producer that "owns" the record will
165*4882a593Smuzhiyunsuccessfully advance producer counter. At that point, data is still not yet
166*4882a593Smuzhiyunready to be consumed, though. Each record has 8 byte header, which contains the
167*4882a593Smuzhiyunlength of reserved record, as well as two extra bits: busy bit to denote that
168*4882a593Smuzhiyunrecord is still being worked on, and discard bit, which might be set at commit
169*4882a593Smuzhiyuntime if record is discarded. In the latter case, consumer is supposed to skip
170*4882a593Smuzhiyunthe record and move on to the next one. Record header also encodes record's
171*4882a593Smuzhiyunrelative offset from the beginning of ring buffer data area (in pages). This
172*4882a593Smuzhiyunallows ``bpf_ringbuf_commit()``/``bpf_ringbuf_discard()`` to accept only the
173*4882a593Smuzhiyunpointer to the record itself, without requiring also the pointer to ring buffer
174*4882a593Smuzhiyunitself. Ring buffer memory location will be restored from record metadata
175*4882a593Smuzhiyunheader. This significantly simplifies verifier, as well as improving API
176*4882a593Smuzhiyunusability.
177*4882a593Smuzhiyun
178*4882a593SmuzhiyunProducer counter increments are serialized under spinlock, so there is
179*4882a593Smuzhiyuna strict ordering between reservations. Commits, on the other hand, are
180*4882a593Smuzhiyuncompletely lockless and independent. All records become available to consumer
181*4882a593Smuzhiyunin the order of reservations, but only after all previous records where
182*4882a593Smuzhiyunalready committed. It is thus possible for slow producers to temporarily hold
183*4882a593Smuzhiyunoff submitted records, that were reserved later.
184*4882a593Smuzhiyun
185*4882a593SmuzhiyunOne interesting implementation bit, that significantly simplifies (and thus
186*4882a593Smuzhiyunspeeds up as well) implementation of both producers and consumers is how data
187*4882a593Smuzhiyunarea is mapped twice contiguously back-to-back in the virtual memory. This
188*4882a593Smuzhiyunallows to not take any special measures for samples that have to wrap around
189*4882a593Smuzhiyunat the end of the circular buffer data area, because the next page after the
190*4882a593Smuzhiyunlast data page would be first data page again, and thus the sample will still
191*4882a593Smuzhiyunappear completely contiguous in virtual memory. See comment and a simple ASCII
192*4882a593Smuzhiyundiagram showing this visually in ``bpf_ringbuf_area_alloc()``.
193*4882a593Smuzhiyun
194*4882a593SmuzhiyunAnother feature that distinguishes BPF ringbuf from perf ring buffer is
195*4882a593Smuzhiyuna self-pacing notifications of new data being availability.
196*4882a593Smuzhiyun``bpf_ringbuf_commit()`` implementation will send a notification of new record
197*4882a593Smuzhiyunbeing available after commit only if consumer has already caught up right up to
198*4882a593Smuzhiyunthe record being committed. If not, consumer still has to catch up and thus
199*4882a593Smuzhiyunwill see new data anyways without needing an extra poll notification.
200*4882a593SmuzhiyunBenchmarks (see tools/testing/selftests/bpf/benchs/bench_ringbufs.c) show that
201*4882a593Smuzhiyunthis allows to achieve a very high throughput without having to resort to
202*4882a593Smuzhiyuntricks like "notify only every Nth sample", which are necessary with perf
203*4882a593Smuzhiyunbuffer. For extreme cases, when BPF program wants more manual control of
204*4882a593Smuzhiyunnotifications, commit/discard/output helpers accept ``BPF_RB_NO_WAKEUP`` and
205*4882a593Smuzhiyun``BPF_RB_FORCE_WAKEUP`` flags, which give full control over notifications of
206*4882a593Smuzhiyundata availability, but require extra caution and diligence in using this API.
207