xref: /OK3568_Linux_fs/kernel/Documentation/bpf/bpf_design_QA.rst (revision 4882a59341e53eb6f0b4789bf948001014eff981)
1*4882a593Smuzhiyun==============
2*4882a593SmuzhiyunBPF Design Q&A
3*4882a593Smuzhiyun==============
4*4882a593Smuzhiyun
5*4882a593SmuzhiyunBPF extensibility and applicability to networking, tracing, security
6*4882a593Smuzhiyunin the linux kernel and several user space implementations of BPF
7*4882a593Smuzhiyunvirtual machine led to a number of misunderstanding on what BPF actually is.
8*4882a593SmuzhiyunThis short QA is an attempt to address that and outline a direction
9*4882a593Smuzhiyunof where BPF is heading long term.
10*4882a593Smuzhiyun
11*4882a593Smuzhiyun.. contents::
12*4882a593Smuzhiyun    :local:
13*4882a593Smuzhiyun    :depth: 3
14*4882a593Smuzhiyun
15*4882a593SmuzhiyunQuestions and Answers
16*4882a593Smuzhiyun=====================
17*4882a593Smuzhiyun
18*4882a593SmuzhiyunQ: Is BPF a generic instruction set similar to x64 and arm64?
19*4882a593Smuzhiyun-------------------------------------------------------------
20*4882a593SmuzhiyunA: NO.
21*4882a593Smuzhiyun
22*4882a593SmuzhiyunQ: Is BPF a generic virtual machine ?
23*4882a593Smuzhiyun-------------------------------------
24*4882a593SmuzhiyunA: NO.
25*4882a593Smuzhiyun
26*4882a593SmuzhiyunBPF is generic instruction set *with* C calling convention.
27*4882a593Smuzhiyun-----------------------------------------------------------
28*4882a593Smuzhiyun
29*4882a593SmuzhiyunQ: Why C calling convention was chosen?
30*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
31*4882a593Smuzhiyun
32*4882a593SmuzhiyunA: Because BPF programs are designed to run in the linux kernel
33*4882a593Smuzhiyunwhich is written in C, hence BPF defines instruction set compatible
34*4882a593Smuzhiyunwith two most used architectures x64 and arm64 (and takes into
35*4882a593Smuzhiyunconsideration important quirks of other architectures) and
36*4882a593Smuzhiyundefines calling convention that is compatible with C calling
37*4882a593Smuzhiyunconvention of the linux kernel on those architectures.
38*4882a593Smuzhiyun
39*4882a593SmuzhiyunQ: Can multiple return values be supported in the future?
40*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
41*4882a593SmuzhiyunA: NO. BPF allows only register R0 to be used as return value.
42*4882a593Smuzhiyun
43*4882a593SmuzhiyunQ: Can more than 5 function arguments be supported in the future?
44*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
45*4882a593SmuzhiyunA: NO. BPF calling convention only allows registers R1-R5 to be used
46*4882a593Smuzhiyunas arguments. BPF is not a standalone instruction set.
47*4882a593Smuzhiyun(unlike x64 ISA that allows msft, cdecl and other conventions)
48*4882a593Smuzhiyun
49*4882a593SmuzhiyunQ: Can BPF programs access instruction pointer or return address?
50*4882a593Smuzhiyun-----------------------------------------------------------------
51*4882a593SmuzhiyunA: NO.
52*4882a593Smuzhiyun
53*4882a593SmuzhiyunQ: Can BPF programs access stack pointer ?
54*4882a593Smuzhiyun------------------------------------------
55*4882a593SmuzhiyunA: NO.
56*4882a593Smuzhiyun
57*4882a593SmuzhiyunOnly frame pointer (register R10) is accessible.
58*4882a593SmuzhiyunFrom compiler point of view it's necessary to have stack pointer.
59*4882a593SmuzhiyunFor example, LLVM defines register R11 as stack pointer in its
60*4882a593SmuzhiyunBPF backend, but it makes sure that generated code never uses it.
61*4882a593Smuzhiyun
62*4882a593SmuzhiyunQ: Does C-calling convention diminishes possible use cases?
63*4882a593Smuzhiyun-----------------------------------------------------------
64*4882a593SmuzhiyunA: YES.
65*4882a593Smuzhiyun
66*4882a593SmuzhiyunBPF design forces addition of major functionality in the form
67*4882a593Smuzhiyunof kernel helper functions and kernel objects like BPF maps with
68*4882a593Smuzhiyunseamless interoperability between them. It lets kernel call into
69*4882a593SmuzhiyunBPF programs and programs call kernel helpers with zero overhead,
70*4882a593Smuzhiyunas all of them were native C code. That is particularly the case
71*4882a593Smuzhiyunfor JITed BPF programs that are indistinguishable from
72*4882a593Smuzhiyunnative kernel C code.
73*4882a593Smuzhiyun
74*4882a593SmuzhiyunQ: Does it mean that 'innovative' extensions to BPF code are disallowed?
75*4882a593Smuzhiyun------------------------------------------------------------------------
76*4882a593SmuzhiyunA: Soft yes.
77*4882a593Smuzhiyun
78*4882a593SmuzhiyunAt least for now, until BPF core has support for
79*4882a593Smuzhiyunbpf-to-bpf calls, indirect calls, loops, global variables,
80*4882a593Smuzhiyunjump tables, read-only sections, and all other normal constructs
81*4882a593Smuzhiyunthat C code can produce.
82*4882a593Smuzhiyun
83*4882a593SmuzhiyunQ: Can loops be supported in a safe way?
84*4882a593Smuzhiyun----------------------------------------
85*4882a593SmuzhiyunA: It's not clear yet.
86*4882a593Smuzhiyun
87*4882a593SmuzhiyunBPF developers are trying to find a way to
88*4882a593Smuzhiyunsupport bounded loops.
89*4882a593Smuzhiyun
90*4882a593SmuzhiyunQ: What are the verifier limits?
91*4882a593Smuzhiyun--------------------------------
92*4882a593SmuzhiyunA: The only limit known to the user space is BPF_MAXINSNS (4096).
93*4882a593SmuzhiyunIt's the maximum number of instructions that the unprivileged bpf
94*4882a593Smuzhiyunprogram can have. The verifier has various internal limits.
95*4882a593SmuzhiyunLike the maximum number of instructions that can be explored during
96*4882a593Smuzhiyunprogram analysis. Currently, that limit is set to 1 million.
97*4882a593SmuzhiyunWhich essentially means that the largest program can consist
98*4882a593Smuzhiyunof 1 million NOP instructions. There is a limit to the maximum number
99*4882a593Smuzhiyunof subsequent branches, a limit to the number of nested bpf-to-bpf
100*4882a593Smuzhiyuncalls, a limit to the number of the verifier states per instruction,
101*4882a593Smuzhiyuna limit to the number of maps used by the program.
102*4882a593SmuzhiyunAll these limits can be hit with a sufficiently complex program.
103*4882a593SmuzhiyunThere are also non-numerical limits that can cause the program
104*4882a593Smuzhiyunto be rejected. The verifier used to recognize only pointer + constant
105*4882a593Smuzhiyunexpressions. Now it can recognize pointer + bounded_register.
106*4882a593Smuzhiyunbpf_lookup_map_elem(key) had a requirement that 'key' must be
107*4882a593Smuzhiyuna pointer to the stack. Now, 'key' can be a pointer to map value.
108*4882a593SmuzhiyunThe verifier is steadily getting 'smarter'. The limits are
109*4882a593Smuzhiyunbeing removed. The only way to know that the program is going to
110*4882a593Smuzhiyunbe accepted by the verifier is to try to load it.
111*4882a593SmuzhiyunThe bpf development process guarantees that the future kernel
112*4882a593Smuzhiyunversions will accept all bpf programs that were accepted by
113*4882a593Smuzhiyunthe earlier versions.
114*4882a593Smuzhiyun
115*4882a593Smuzhiyun
116*4882a593SmuzhiyunInstruction level questions
117*4882a593Smuzhiyun---------------------------
118*4882a593Smuzhiyun
119*4882a593SmuzhiyunQ: LD_ABS and LD_IND instructions vs C code
120*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
121*4882a593Smuzhiyun
122*4882a593SmuzhiyunQ: How come LD_ABS and LD_IND instruction are present in BPF whereas
123*4882a593SmuzhiyunC code cannot express them and has to use builtin intrinsics?
124*4882a593Smuzhiyun
125*4882a593SmuzhiyunA: This is artifact of compatibility with classic BPF. Modern
126*4882a593Smuzhiyunnetworking code in BPF performs better without them.
127*4882a593SmuzhiyunSee 'direct packet access'.
128*4882a593Smuzhiyun
129*4882a593SmuzhiyunQ: BPF instructions mapping not one-to-one to native CPU
130*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
131*4882a593SmuzhiyunQ: It seems not all BPF instructions are one-to-one to native CPU.
132*4882a593SmuzhiyunFor example why BPF_JNE and other compare and jumps are not cpu-like?
133*4882a593Smuzhiyun
134*4882a593SmuzhiyunA: This was necessary to avoid introducing flags into ISA which are
135*4882a593Smuzhiyunimpossible to make generic and efficient across CPU architectures.
136*4882a593Smuzhiyun
137*4882a593SmuzhiyunQ: Why BPF_DIV instruction doesn't map to x64 div?
138*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
139*4882a593SmuzhiyunA: Because if we picked one-to-one relationship to x64 it would have made
140*4882a593Smuzhiyunit more complicated to support on arm64 and other archs. Also it
141*4882a593Smuzhiyunneeds div-by-zero runtime check.
142*4882a593Smuzhiyun
143*4882a593SmuzhiyunQ: Why there is no BPF_SDIV for signed divide operation?
144*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
145*4882a593SmuzhiyunA: Because it would be rarely used. llvm errors in such case and
146*4882a593Smuzhiyunprints a suggestion to use unsigned divide instead.
147*4882a593Smuzhiyun
148*4882a593SmuzhiyunQ: Why BPF has implicit prologue and epilogue?
149*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
150*4882a593SmuzhiyunA: Because architectures like sparc have register windows and in general
151*4882a593Smuzhiyunthere are enough subtle differences between architectures, so naive
152*4882a593Smuzhiyunstore return address into stack won't work. Another reason is BPF has
153*4882a593Smuzhiyunto be safe from division by zero (and legacy exception path
154*4882a593Smuzhiyunof LD_ABS insn). Those instructions need to invoke epilogue and
155*4882a593Smuzhiyunreturn implicitly.
156*4882a593Smuzhiyun
157*4882a593SmuzhiyunQ: Why BPF_JLT and BPF_JLE instructions were not introduced in the beginning?
158*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
159*4882a593SmuzhiyunA: Because classic BPF didn't have them and BPF authors felt that compiler
160*4882a593Smuzhiyunworkaround would be acceptable. Turned out that programs lose performance
161*4882a593Smuzhiyundue to lack of these compare instructions and they were added.
162*4882a593SmuzhiyunThese two instructions is a perfect example what kind of new BPF
163*4882a593Smuzhiyuninstructions are acceptable and can be added in the future.
164*4882a593SmuzhiyunThese two already had equivalent instructions in native CPUs.
165*4882a593SmuzhiyunNew instructions that don't have one-to-one mapping to HW instructions
166*4882a593Smuzhiyunwill not be accepted.
167*4882a593Smuzhiyun
168*4882a593SmuzhiyunQ: BPF 32-bit subregister requirements
169*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
170*4882a593SmuzhiyunQ: BPF 32-bit subregisters have a requirement to zero upper 32-bits of BPF
171*4882a593Smuzhiyunregisters which makes BPF inefficient virtual machine for 32-bit
172*4882a593SmuzhiyunCPU architectures and 32-bit HW accelerators. Can true 32-bit registers
173*4882a593Smuzhiyunbe added to BPF in the future?
174*4882a593Smuzhiyun
175*4882a593SmuzhiyunA: NO.
176*4882a593Smuzhiyun
177*4882a593SmuzhiyunBut some optimizations on zero-ing the upper 32 bits for BPF registers are
178*4882a593Smuzhiyunavailable, and can be leveraged to improve the performance of JITed BPF
179*4882a593Smuzhiyunprograms for 32-bit architectures.
180*4882a593Smuzhiyun
181*4882a593SmuzhiyunStarting with version 7, LLVM is able to generate instructions that operate
182*4882a593Smuzhiyunon 32-bit subregisters, provided the option -mattr=+alu32 is passed for
183*4882a593Smuzhiyuncompiling a program. Furthermore, the verifier can now mark the
184*4882a593Smuzhiyuninstructions for which zero-ing the upper bits of the destination register
185*4882a593Smuzhiyunis required, and insert an explicit zero-extension (zext) instruction
186*4882a593Smuzhiyun(a mov32 variant). This means that for architectures without zext hardware
187*4882a593Smuzhiyunsupport, the JIT back-ends do not need to clear the upper bits for
188*4882a593Smuzhiyunsubregisters written by alu32 instructions or narrow loads. Instead, the
189*4882a593Smuzhiyunback-ends simply need to support code generation for that mov32 variant,
190*4882a593Smuzhiyunand to overwrite bpf_jit_needs_zext() to make it return "true" (in order to
191*4882a593Smuzhiyunenable zext insertion in the verifier).
192*4882a593Smuzhiyun
193*4882a593SmuzhiyunNote that it is possible for a JIT back-end to have partial hardware
194*4882a593Smuzhiyunsupport for zext. In that case, if verifier zext insertion is enabled,
195*4882a593Smuzhiyunit could lead to the insertion of unnecessary zext instructions. Such
196*4882a593Smuzhiyuninstructions could be removed by creating a simple peephole inside the JIT
197*4882a593Smuzhiyunback-end: if one instruction has hardware support for zext and if the next
198*4882a593Smuzhiyuninstruction is an explicit zext, then the latter can be skipped when doing
199*4882a593Smuzhiyunthe code generation.
200*4882a593Smuzhiyun
201*4882a593SmuzhiyunQ: Does BPF have a stable ABI?
202*4882a593Smuzhiyun------------------------------
203*4882a593SmuzhiyunA: YES. BPF instructions, arguments to BPF programs, set of helper
204*4882a593Smuzhiyunfunctions and their arguments, recognized return codes are all part
205*4882a593Smuzhiyunof ABI. However there is one specific exception to tracing programs
206*4882a593Smuzhiyunwhich are using helpers like bpf_probe_read() to walk kernel internal
207*4882a593Smuzhiyundata structures and compile with kernel internal headers. Both of these
208*4882a593Smuzhiyunkernel internals are subject to change and can break with newer kernels
209*4882a593Smuzhiyunsuch that the program needs to be adapted accordingly.
210*4882a593Smuzhiyun
211*4882a593SmuzhiyunQ: How much stack space a BPF program uses?
212*4882a593Smuzhiyun-------------------------------------------
213*4882a593SmuzhiyunA: Currently all program types are limited to 512 bytes of stack
214*4882a593Smuzhiyunspace, but the verifier computes the actual amount of stack used
215*4882a593Smuzhiyunand both interpreter and most JITed code consume necessary amount.
216*4882a593Smuzhiyun
217*4882a593SmuzhiyunQ: Can BPF be offloaded to HW?
218*4882a593Smuzhiyun------------------------------
219*4882a593SmuzhiyunA: YES. BPF HW offload is supported by NFP driver.
220*4882a593Smuzhiyun
221*4882a593SmuzhiyunQ: Does classic BPF interpreter still exist?
222*4882a593Smuzhiyun--------------------------------------------
223*4882a593SmuzhiyunA: NO. Classic BPF programs are converted into extend BPF instructions.
224*4882a593Smuzhiyun
225*4882a593SmuzhiyunQ: Can BPF call arbitrary kernel functions?
226*4882a593Smuzhiyun-------------------------------------------
227*4882a593SmuzhiyunA: NO. BPF programs can only call a set of helper functions which
228*4882a593Smuzhiyunis defined for every program type.
229*4882a593Smuzhiyun
230*4882a593SmuzhiyunQ: Can BPF overwrite arbitrary kernel memory?
231*4882a593Smuzhiyun---------------------------------------------
232*4882a593SmuzhiyunA: NO.
233*4882a593Smuzhiyun
234*4882a593SmuzhiyunTracing bpf programs can *read* arbitrary memory with bpf_probe_read()
235*4882a593Smuzhiyunand bpf_probe_read_str() helpers. Networking programs cannot read
236*4882a593Smuzhiyunarbitrary memory, since they don't have access to these helpers.
237*4882a593SmuzhiyunPrograms can never read or write arbitrary memory directly.
238*4882a593Smuzhiyun
239*4882a593SmuzhiyunQ: Can BPF overwrite arbitrary user memory?
240*4882a593Smuzhiyun-------------------------------------------
241*4882a593SmuzhiyunA: Sort-of.
242*4882a593Smuzhiyun
243*4882a593SmuzhiyunTracing BPF programs can overwrite the user memory
244*4882a593Smuzhiyunof the current task with bpf_probe_write_user(). Every time such
245*4882a593Smuzhiyunprogram is loaded the kernel will print warning message, so
246*4882a593Smuzhiyunthis helper is only useful for experiments and prototypes.
247*4882a593SmuzhiyunTracing BPF programs are root only.
248*4882a593Smuzhiyun
249*4882a593SmuzhiyunQ: New functionality via kernel modules?
250*4882a593Smuzhiyun----------------------------------------
251*4882a593SmuzhiyunQ: Can BPF functionality such as new program or map types, new
252*4882a593Smuzhiyunhelpers, etc be added out of kernel module code?
253*4882a593Smuzhiyun
254*4882a593SmuzhiyunA: NO.
255