1*4882a593Smuzhiyun============== 2*4882a593SmuzhiyunBPF Design Q&A 3*4882a593Smuzhiyun============== 4*4882a593Smuzhiyun 5*4882a593SmuzhiyunBPF extensibility and applicability to networking, tracing, security 6*4882a593Smuzhiyunin the linux kernel and several user space implementations of BPF 7*4882a593Smuzhiyunvirtual machine led to a number of misunderstanding on what BPF actually is. 8*4882a593SmuzhiyunThis short QA is an attempt to address that and outline a direction 9*4882a593Smuzhiyunof where BPF is heading long term. 10*4882a593Smuzhiyun 11*4882a593Smuzhiyun.. contents:: 12*4882a593Smuzhiyun :local: 13*4882a593Smuzhiyun :depth: 3 14*4882a593Smuzhiyun 15*4882a593SmuzhiyunQuestions and Answers 16*4882a593Smuzhiyun===================== 17*4882a593Smuzhiyun 18*4882a593SmuzhiyunQ: Is BPF a generic instruction set similar to x64 and arm64? 19*4882a593Smuzhiyun------------------------------------------------------------- 20*4882a593SmuzhiyunA: NO. 21*4882a593Smuzhiyun 22*4882a593SmuzhiyunQ: Is BPF a generic virtual machine ? 23*4882a593Smuzhiyun------------------------------------- 24*4882a593SmuzhiyunA: NO. 25*4882a593Smuzhiyun 26*4882a593SmuzhiyunBPF is generic instruction set *with* C calling convention. 27*4882a593Smuzhiyun----------------------------------------------------------- 28*4882a593Smuzhiyun 29*4882a593SmuzhiyunQ: Why C calling convention was chosen? 30*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 31*4882a593Smuzhiyun 32*4882a593SmuzhiyunA: Because BPF programs are designed to run in the linux kernel 33*4882a593Smuzhiyunwhich is written in C, hence BPF defines instruction set compatible 34*4882a593Smuzhiyunwith two most used architectures x64 and arm64 (and takes into 35*4882a593Smuzhiyunconsideration important quirks of other architectures) and 36*4882a593Smuzhiyundefines calling convention that is compatible with C calling 37*4882a593Smuzhiyunconvention of the linux kernel on those architectures. 38*4882a593Smuzhiyun 39*4882a593SmuzhiyunQ: Can multiple return values be supported in the future? 40*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 41*4882a593SmuzhiyunA: NO. BPF allows only register R0 to be used as return value. 42*4882a593Smuzhiyun 43*4882a593SmuzhiyunQ: Can more than 5 function arguments be supported in the future? 44*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 45*4882a593SmuzhiyunA: NO. BPF calling convention only allows registers R1-R5 to be used 46*4882a593Smuzhiyunas arguments. BPF is not a standalone instruction set. 47*4882a593Smuzhiyun(unlike x64 ISA that allows msft, cdecl and other conventions) 48*4882a593Smuzhiyun 49*4882a593SmuzhiyunQ: Can BPF programs access instruction pointer or return address? 50*4882a593Smuzhiyun----------------------------------------------------------------- 51*4882a593SmuzhiyunA: NO. 52*4882a593Smuzhiyun 53*4882a593SmuzhiyunQ: Can BPF programs access stack pointer ? 54*4882a593Smuzhiyun------------------------------------------ 55*4882a593SmuzhiyunA: NO. 56*4882a593Smuzhiyun 57*4882a593SmuzhiyunOnly frame pointer (register R10) is accessible. 58*4882a593SmuzhiyunFrom compiler point of view it's necessary to have stack pointer. 59*4882a593SmuzhiyunFor example, LLVM defines register R11 as stack pointer in its 60*4882a593SmuzhiyunBPF backend, but it makes sure that generated code never uses it. 61*4882a593Smuzhiyun 62*4882a593SmuzhiyunQ: Does C-calling convention diminishes possible use cases? 63*4882a593Smuzhiyun----------------------------------------------------------- 64*4882a593SmuzhiyunA: YES. 65*4882a593Smuzhiyun 66*4882a593SmuzhiyunBPF design forces addition of major functionality in the form 67*4882a593Smuzhiyunof kernel helper functions and kernel objects like BPF maps with 68*4882a593Smuzhiyunseamless interoperability between them. It lets kernel call into 69*4882a593SmuzhiyunBPF programs and programs call kernel helpers with zero overhead, 70*4882a593Smuzhiyunas all of them were native C code. That is particularly the case 71*4882a593Smuzhiyunfor JITed BPF programs that are indistinguishable from 72*4882a593Smuzhiyunnative kernel C code. 73*4882a593Smuzhiyun 74*4882a593SmuzhiyunQ: Does it mean that 'innovative' extensions to BPF code are disallowed? 75*4882a593Smuzhiyun------------------------------------------------------------------------ 76*4882a593SmuzhiyunA: Soft yes. 77*4882a593Smuzhiyun 78*4882a593SmuzhiyunAt least for now, until BPF core has support for 79*4882a593Smuzhiyunbpf-to-bpf calls, indirect calls, loops, global variables, 80*4882a593Smuzhiyunjump tables, read-only sections, and all other normal constructs 81*4882a593Smuzhiyunthat C code can produce. 82*4882a593Smuzhiyun 83*4882a593SmuzhiyunQ: Can loops be supported in a safe way? 84*4882a593Smuzhiyun---------------------------------------- 85*4882a593SmuzhiyunA: It's not clear yet. 86*4882a593Smuzhiyun 87*4882a593SmuzhiyunBPF developers are trying to find a way to 88*4882a593Smuzhiyunsupport bounded loops. 89*4882a593Smuzhiyun 90*4882a593SmuzhiyunQ: What are the verifier limits? 91*4882a593Smuzhiyun-------------------------------- 92*4882a593SmuzhiyunA: The only limit known to the user space is BPF_MAXINSNS (4096). 93*4882a593SmuzhiyunIt's the maximum number of instructions that the unprivileged bpf 94*4882a593Smuzhiyunprogram can have. The verifier has various internal limits. 95*4882a593SmuzhiyunLike the maximum number of instructions that can be explored during 96*4882a593Smuzhiyunprogram analysis. Currently, that limit is set to 1 million. 97*4882a593SmuzhiyunWhich essentially means that the largest program can consist 98*4882a593Smuzhiyunof 1 million NOP instructions. There is a limit to the maximum number 99*4882a593Smuzhiyunof subsequent branches, a limit to the number of nested bpf-to-bpf 100*4882a593Smuzhiyuncalls, a limit to the number of the verifier states per instruction, 101*4882a593Smuzhiyuna limit to the number of maps used by the program. 102*4882a593SmuzhiyunAll these limits can be hit with a sufficiently complex program. 103*4882a593SmuzhiyunThere are also non-numerical limits that can cause the program 104*4882a593Smuzhiyunto be rejected. The verifier used to recognize only pointer + constant 105*4882a593Smuzhiyunexpressions. Now it can recognize pointer + bounded_register. 106*4882a593Smuzhiyunbpf_lookup_map_elem(key) had a requirement that 'key' must be 107*4882a593Smuzhiyuna pointer to the stack. Now, 'key' can be a pointer to map value. 108*4882a593SmuzhiyunThe verifier is steadily getting 'smarter'. The limits are 109*4882a593Smuzhiyunbeing removed. The only way to know that the program is going to 110*4882a593Smuzhiyunbe accepted by the verifier is to try to load it. 111*4882a593SmuzhiyunThe bpf development process guarantees that the future kernel 112*4882a593Smuzhiyunversions will accept all bpf programs that were accepted by 113*4882a593Smuzhiyunthe earlier versions. 114*4882a593Smuzhiyun 115*4882a593Smuzhiyun 116*4882a593SmuzhiyunInstruction level questions 117*4882a593Smuzhiyun--------------------------- 118*4882a593Smuzhiyun 119*4882a593SmuzhiyunQ: LD_ABS and LD_IND instructions vs C code 120*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 121*4882a593Smuzhiyun 122*4882a593SmuzhiyunQ: How come LD_ABS and LD_IND instruction are present in BPF whereas 123*4882a593SmuzhiyunC code cannot express them and has to use builtin intrinsics? 124*4882a593Smuzhiyun 125*4882a593SmuzhiyunA: This is artifact of compatibility with classic BPF. Modern 126*4882a593Smuzhiyunnetworking code in BPF performs better without them. 127*4882a593SmuzhiyunSee 'direct packet access'. 128*4882a593Smuzhiyun 129*4882a593SmuzhiyunQ: BPF instructions mapping not one-to-one to native CPU 130*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 131*4882a593SmuzhiyunQ: It seems not all BPF instructions are one-to-one to native CPU. 132*4882a593SmuzhiyunFor example why BPF_JNE and other compare and jumps are not cpu-like? 133*4882a593Smuzhiyun 134*4882a593SmuzhiyunA: This was necessary to avoid introducing flags into ISA which are 135*4882a593Smuzhiyunimpossible to make generic and efficient across CPU architectures. 136*4882a593Smuzhiyun 137*4882a593SmuzhiyunQ: Why BPF_DIV instruction doesn't map to x64 div? 138*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 139*4882a593SmuzhiyunA: Because if we picked one-to-one relationship to x64 it would have made 140*4882a593Smuzhiyunit more complicated to support on arm64 and other archs. Also it 141*4882a593Smuzhiyunneeds div-by-zero runtime check. 142*4882a593Smuzhiyun 143*4882a593SmuzhiyunQ: Why there is no BPF_SDIV for signed divide operation? 144*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 145*4882a593SmuzhiyunA: Because it would be rarely used. llvm errors in such case and 146*4882a593Smuzhiyunprints a suggestion to use unsigned divide instead. 147*4882a593Smuzhiyun 148*4882a593SmuzhiyunQ: Why BPF has implicit prologue and epilogue? 149*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 150*4882a593SmuzhiyunA: Because architectures like sparc have register windows and in general 151*4882a593Smuzhiyunthere are enough subtle differences between architectures, so naive 152*4882a593Smuzhiyunstore return address into stack won't work. Another reason is BPF has 153*4882a593Smuzhiyunto be safe from division by zero (and legacy exception path 154*4882a593Smuzhiyunof LD_ABS insn). Those instructions need to invoke epilogue and 155*4882a593Smuzhiyunreturn implicitly. 156*4882a593Smuzhiyun 157*4882a593SmuzhiyunQ: Why BPF_JLT and BPF_JLE instructions were not introduced in the beginning? 158*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 159*4882a593SmuzhiyunA: Because classic BPF didn't have them and BPF authors felt that compiler 160*4882a593Smuzhiyunworkaround would be acceptable. Turned out that programs lose performance 161*4882a593Smuzhiyundue to lack of these compare instructions and they were added. 162*4882a593SmuzhiyunThese two instructions is a perfect example what kind of new BPF 163*4882a593Smuzhiyuninstructions are acceptable and can be added in the future. 164*4882a593SmuzhiyunThese two already had equivalent instructions in native CPUs. 165*4882a593SmuzhiyunNew instructions that don't have one-to-one mapping to HW instructions 166*4882a593Smuzhiyunwill not be accepted. 167*4882a593Smuzhiyun 168*4882a593SmuzhiyunQ: BPF 32-bit subregister requirements 169*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 170*4882a593SmuzhiyunQ: BPF 32-bit subregisters have a requirement to zero upper 32-bits of BPF 171*4882a593Smuzhiyunregisters which makes BPF inefficient virtual machine for 32-bit 172*4882a593SmuzhiyunCPU architectures and 32-bit HW accelerators. Can true 32-bit registers 173*4882a593Smuzhiyunbe added to BPF in the future? 174*4882a593Smuzhiyun 175*4882a593SmuzhiyunA: NO. 176*4882a593Smuzhiyun 177*4882a593SmuzhiyunBut some optimizations on zero-ing the upper 32 bits for BPF registers are 178*4882a593Smuzhiyunavailable, and can be leveraged to improve the performance of JITed BPF 179*4882a593Smuzhiyunprograms for 32-bit architectures. 180*4882a593Smuzhiyun 181*4882a593SmuzhiyunStarting with version 7, LLVM is able to generate instructions that operate 182*4882a593Smuzhiyunon 32-bit subregisters, provided the option -mattr=+alu32 is passed for 183*4882a593Smuzhiyuncompiling a program. Furthermore, the verifier can now mark the 184*4882a593Smuzhiyuninstructions for which zero-ing the upper bits of the destination register 185*4882a593Smuzhiyunis required, and insert an explicit zero-extension (zext) instruction 186*4882a593Smuzhiyun(a mov32 variant). This means that for architectures without zext hardware 187*4882a593Smuzhiyunsupport, the JIT back-ends do not need to clear the upper bits for 188*4882a593Smuzhiyunsubregisters written by alu32 instructions or narrow loads. Instead, the 189*4882a593Smuzhiyunback-ends simply need to support code generation for that mov32 variant, 190*4882a593Smuzhiyunand to overwrite bpf_jit_needs_zext() to make it return "true" (in order to 191*4882a593Smuzhiyunenable zext insertion in the verifier). 192*4882a593Smuzhiyun 193*4882a593SmuzhiyunNote that it is possible for a JIT back-end to have partial hardware 194*4882a593Smuzhiyunsupport for zext. In that case, if verifier zext insertion is enabled, 195*4882a593Smuzhiyunit could lead to the insertion of unnecessary zext instructions. Such 196*4882a593Smuzhiyuninstructions could be removed by creating a simple peephole inside the JIT 197*4882a593Smuzhiyunback-end: if one instruction has hardware support for zext and if the next 198*4882a593Smuzhiyuninstruction is an explicit zext, then the latter can be skipped when doing 199*4882a593Smuzhiyunthe code generation. 200*4882a593Smuzhiyun 201*4882a593SmuzhiyunQ: Does BPF have a stable ABI? 202*4882a593Smuzhiyun------------------------------ 203*4882a593SmuzhiyunA: YES. BPF instructions, arguments to BPF programs, set of helper 204*4882a593Smuzhiyunfunctions and their arguments, recognized return codes are all part 205*4882a593Smuzhiyunof ABI. However there is one specific exception to tracing programs 206*4882a593Smuzhiyunwhich are using helpers like bpf_probe_read() to walk kernel internal 207*4882a593Smuzhiyundata structures and compile with kernel internal headers. Both of these 208*4882a593Smuzhiyunkernel internals are subject to change and can break with newer kernels 209*4882a593Smuzhiyunsuch that the program needs to be adapted accordingly. 210*4882a593Smuzhiyun 211*4882a593SmuzhiyunQ: How much stack space a BPF program uses? 212*4882a593Smuzhiyun------------------------------------------- 213*4882a593SmuzhiyunA: Currently all program types are limited to 512 bytes of stack 214*4882a593Smuzhiyunspace, but the verifier computes the actual amount of stack used 215*4882a593Smuzhiyunand both interpreter and most JITed code consume necessary amount. 216*4882a593Smuzhiyun 217*4882a593SmuzhiyunQ: Can BPF be offloaded to HW? 218*4882a593Smuzhiyun------------------------------ 219*4882a593SmuzhiyunA: YES. BPF HW offload is supported by NFP driver. 220*4882a593Smuzhiyun 221*4882a593SmuzhiyunQ: Does classic BPF interpreter still exist? 222*4882a593Smuzhiyun-------------------------------------------- 223*4882a593SmuzhiyunA: NO. Classic BPF programs are converted into extend BPF instructions. 224*4882a593Smuzhiyun 225*4882a593SmuzhiyunQ: Can BPF call arbitrary kernel functions? 226*4882a593Smuzhiyun------------------------------------------- 227*4882a593SmuzhiyunA: NO. BPF programs can only call a set of helper functions which 228*4882a593Smuzhiyunis defined for every program type. 229*4882a593Smuzhiyun 230*4882a593SmuzhiyunQ: Can BPF overwrite arbitrary kernel memory? 231*4882a593Smuzhiyun--------------------------------------------- 232*4882a593SmuzhiyunA: NO. 233*4882a593Smuzhiyun 234*4882a593SmuzhiyunTracing bpf programs can *read* arbitrary memory with bpf_probe_read() 235*4882a593Smuzhiyunand bpf_probe_read_str() helpers. Networking programs cannot read 236*4882a593Smuzhiyunarbitrary memory, since they don't have access to these helpers. 237*4882a593SmuzhiyunPrograms can never read or write arbitrary memory directly. 238*4882a593Smuzhiyun 239*4882a593SmuzhiyunQ: Can BPF overwrite arbitrary user memory? 240*4882a593Smuzhiyun------------------------------------------- 241*4882a593SmuzhiyunA: Sort-of. 242*4882a593Smuzhiyun 243*4882a593SmuzhiyunTracing BPF programs can overwrite the user memory 244*4882a593Smuzhiyunof the current task with bpf_probe_write_user(). Every time such 245*4882a593Smuzhiyunprogram is loaded the kernel will print warning message, so 246*4882a593Smuzhiyunthis helper is only useful for experiments and prototypes. 247*4882a593SmuzhiyunTracing BPF programs are root only. 248*4882a593Smuzhiyun 249*4882a593SmuzhiyunQ: New functionality via kernel modules? 250*4882a593Smuzhiyun---------------------------------------- 251*4882a593SmuzhiyunQ: Can BPF functionality such as new program or map types, new 252*4882a593Smuzhiyunhelpers, etc be added out of kernel module code? 253*4882a593Smuzhiyun 254*4882a593SmuzhiyunA: NO. 255