Documentation/bpf/bpf_design_QA.rst

*4882a593Smuzhiyun==============
*4882a593SmuzhiyunBPF Design Q&A
*4882a593Smuzhiyun==============
*4882a593Smuzhiyun
*4882a593SmuzhiyunBPF extensibility and applicability to networking, tracing, security
*4882a593Smuzhiyunin the linux kernel and several user space implementations of BPF
*4882a593Smuzhiyunvirtual machine led to a number of misunderstanding on what BPF actually is.
*4882a593SmuzhiyunThis short QA is an attempt to address that and outline a direction
*4882a593Smuzhiyunof where BPF is heading long term.
*4882a593Smuzhiyun
*4882a593Smuzhiyun.. contents::
*4882a593Smuzhiyun    :local:
*4882a593Smuzhiyun    :depth: 3
*4882a593Smuzhiyun
*4882a593SmuzhiyunQuestions and Answers
*4882a593Smuzhiyun=====================
*4882a593Smuzhiyun
*4882a593SmuzhiyunQ: Is BPF a generic instruction set similar to x64 and arm64?
*4882a593Smuzhiyun-------------------------------------------------------------
*4882a593SmuzhiyunA: NO.
*4882a593Smuzhiyun
*4882a593SmuzhiyunQ: Is BPF a generic virtual machine ?
*4882a593Smuzhiyun-------------------------------------
*4882a593SmuzhiyunA: NO.
*4882a593Smuzhiyun
*4882a593SmuzhiyunBPF is generic instruction set *with* C calling convention.
*4882a593Smuzhiyun-----------------------------------------------------------
*4882a593Smuzhiyun
*4882a593SmuzhiyunQ: Why C calling convention was chosen?
*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
*4882a593Smuzhiyun
*4882a593SmuzhiyunA: Because BPF programs are designed to run in the linux kernel
*4882a593Smuzhiyunwhich is written in C, hence BPF defines instruction set compatible
*4882a593Smuzhiyunwith two most used architectures x64 and arm64 (and takes into
*4882a593Smuzhiyunconsideration important quirks of other architectures) and
*4882a593Smuzhiyundefines calling convention that is compatible with C calling
*4882a593Smuzhiyunconvention of the linux kernel on those architectures.
*4882a593Smuzhiyun
*4882a593SmuzhiyunQ: Can multiple return values be supported in the future?
*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
*4882a593SmuzhiyunA: NO. BPF allows only register R0 to be used as return value.
*4882a593Smuzhiyun
*4882a593SmuzhiyunQ: Can more than 5 function arguments be supported in the future?
*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
*4882a593SmuzhiyunA: NO. BPF calling convention only allows registers R1-R5 to be used
*4882a593Smuzhiyunas arguments. BPF is not a standalone instruction set.
*4882a593Smuzhiyun(unlike x64 ISA that allows msft, cdecl and other conventions)
*4882a593Smuzhiyun
*4882a593SmuzhiyunQ: Can BPF programs access instruction pointer or return address?
*4882a593Smuzhiyun-----------------------------------------------------------------
*4882a593SmuzhiyunA: NO.
*4882a593Smuzhiyun
*4882a593SmuzhiyunQ: Can BPF programs access stack pointer ?
*4882a593Smuzhiyun------------------------------------------
*4882a593SmuzhiyunA: NO.
*4882a593Smuzhiyun
*4882a593SmuzhiyunOnly frame pointer (register R10) is accessible.
*4882a593SmuzhiyunFrom compiler point of view it's necessary to have stack pointer.
*4882a593SmuzhiyunFor example, LLVM defines register R11 as stack pointer in its
*4882a593SmuzhiyunBPF backend, but it makes sure that generated code never uses it.
*4882a593Smuzhiyun
*4882a593SmuzhiyunQ: Does C-calling convention diminishes possible use cases?
*4882a593Smuzhiyun-----------------------------------------------------------
*4882a593SmuzhiyunA: YES.
*4882a593Smuzhiyun
*4882a593SmuzhiyunBPF design forces addition of major functionality in the form
*4882a593Smuzhiyunof kernel helper functions and kernel objects like BPF maps with
*4882a593Smuzhiyunseamless interoperability between them. It lets kernel call into
*4882a593SmuzhiyunBPF programs and programs call kernel helpers with zero overhead,
*4882a593Smuzhiyunas all of them were native C code. That is particularly the case
*4882a593Smuzhiyunfor JITed BPF programs that are indistinguishable from
*4882a593Smuzhiyunnative kernel C code.
*4882a593Smuzhiyun
*4882a593SmuzhiyunQ: Does it mean that 'innovative' extensions to BPF code are disallowed?
*4882a593Smuzhiyun------------------------------------------------------------------------
*4882a593SmuzhiyunA: Soft yes.
*4882a593Smuzhiyun
*4882a593SmuzhiyunAt least for now, until BPF core has support for
*4882a593Smuzhiyunbpf-to-bpf calls, indirect calls, loops, global variables,
*4882a593Smuzhiyunjump tables, read-only sections, and all other normal constructs
*4882a593Smuzhiyunthat C code can produce.
*4882a593Smuzhiyun
*4882a593SmuzhiyunQ: Can loops be supported in a safe way?
*4882a593Smuzhiyun----------------------------------------
*4882a593SmuzhiyunA: It's not clear yet.
*4882a593Smuzhiyun
*4882a593SmuzhiyunBPF developers are trying to find a way to
*4882a593Smuzhiyunsupport bounded loops.
*4882a593Smuzhiyun
*4882a593SmuzhiyunQ: What are the verifier limits?
*4882a593Smuzhiyun--------------------------------
*4882a593SmuzhiyunA: The only limit known to the user space is BPF_MAXINSNS (4096).
*4882a593SmuzhiyunIt's the maximum number of instructions that the unprivileged bpf
*4882a593Smuzhiyunprogram can have. The verifier has various internal limits.
*4882a593SmuzhiyunLike the maximum number of instructions that can be explored during
*4882a593Smuzhiyunprogram analysis. Currently, that limit is set to 1 million.
*4882a593SmuzhiyunWhich essentially means that the largest program can consist
*4882a593Smuzhiyunof 1 million NOP instructions. There is a limit to the maximum number
*4882a593Smuzhiyunof subsequent branches, a limit to the number of nested bpf-to-bpf
*4882a593Smuzhiyuncalls, a limit to the number of the verifier states per instruction,
*4882a593Smuzhiyuna limit to the number of maps used by the program.
*4882a593SmuzhiyunAll these limits can be hit with a sufficiently complex program.
*4882a593SmuzhiyunThere are also non-numerical limits that can cause the program
*4882a593Smuzhiyunto be rejected. The verifier used to recognize only pointer + constant
*4882a593Smuzhiyunexpressions. Now it can recognize pointer + bounded_register.
*4882a593Smuzhiyunbpf_lookup_map_elem(key) had a requirement that 'key' must be
*4882a593Smuzhiyuna pointer to the stack. Now, 'key' can be a pointer to map value.
*4882a593SmuzhiyunThe verifier is steadily getting 'smarter'. The limits are
*4882a593Smuzhiyunbeing removed. The only way to know that the program is going to
*4882a593Smuzhiyunbe accepted by the verifier is to try to load it.
*4882a593SmuzhiyunThe bpf development process guarantees that the future kernel
*4882a593Smuzhiyunversions will accept all bpf programs that were accepted by
*4882a593Smuzhiyunthe earlier versions.
*4882a593Smuzhiyun
*4882a593Smuzhiyun
*4882a593SmuzhiyunInstruction level questions
*4882a593Smuzhiyun---------------------------
*4882a593Smuzhiyun
*4882a593SmuzhiyunQ: LD_ABS and LD_IND instructions vs C code
*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
*4882a593Smuzhiyun
*4882a593SmuzhiyunQ: How come LD_ABS and LD_IND instruction are present in BPF whereas
*4882a593SmuzhiyunC code cannot express them and has to use builtin intrinsics?
*4882a593Smuzhiyun
*4882a593SmuzhiyunA: This is artifact of compatibility with classic BPF. Modern
*4882a593Smuzhiyunnetworking code in BPF performs better without them.
*4882a593SmuzhiyunSee 'direct packet access'.
*4882a593Smuzhiyun
*4882a593SmuzhiyunQ: BPF instructions mapping not one-to-one to native CPU
*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
*4882a593SmuzhiyunQ: It seems not all BPF instructions are one-to-one to native CPU.
*4882a593SmuzhiyunFor example why BPF_JNE and other compare and jumps are not cpu-like?
*4882a593Smuzhiyun
*4882a593SmuzhiyunA: This was necessary to avoid introducing flags into ISA which are
*4882a593Smuzhiyunimpossible to make generic and efficient across CPU architectures.
*4882a593Smuzhiyun
*4882a593SmuzhiyunQ: Why BPF_DIV instruction doesn't map to x64 div?
*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
*4882a593SmuzhiyunA: Because if we picked one-to-one relationship to x64 it would have made
*4882a593Smuzhiyunit more complicated to support on arm64 and other archs. Also it
*4882a593Smuzhiyunneeds div-by-zero runtime check.
*4882a593Smuzhiyun
*4882a593SmuzhiyunQ: Why there is no BPF_SDIV for signed divide operation?
*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
*4882a593SmuzhiyunA: Because it would be rarely used. llvm errors in such case and
*4882a593Smuzhiyunprints a suggestion to use unsigned divide instead.
*4882a593Smuzhiyun
*4882a593SmuzhiyunQ: Why BPF has implicit prologue and epilogue?
*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
*4882a593SmuzhiyunA: Because architectures like sparc have register windows and in general
*4882a593Smuzhiyunthere are enough subtle differences between architectures, so naive
*4882a593Smuzhiyunstore return address into stack won't work. Another reason is BPF has
*4882a593Smuzhiyunto be safe from division by zero (and legacy exception path
*4882a593Smuzhiyunof LD_ABS insn). Those instructions need to invoke epilogue and
*4882a593Smuzhiyunreturn implicitly.
*4882a593Smuzhiyun
*4882a593SmuzhiyunQ: Why BPF_JLT and BPF_JLE instructions were not introduced in the beginning?
*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
*4882a593SmuzhiyunA: Because classic BPF didn't have them and BPF authors felt that compiler
*4882a593Smuzhiyunworkaround would be acceptable. Turned out that programs lose performance
*4882a593Smuzhiyundue to lack of these compare instructions and they were added.
*4882a593SmuzhiyunThese two instructions is a perfect example what kind of new BPF
*4882a593Smuzhiyuninstructions are acceptable and can be added in the future.
*4882a593SmuzhiyunThese two already had equivalent instructions in native CPUs.
*4882a593SmuzhiyunNew instructions that don't have one-to-one mapping to HW instructions
*4882a593Smuzhiyunwill not be accepted.
*4882a593Smuzhiyun
*4882a593SmuzhiyunQ: BPF 32-bit subregister requirements
*4882a593Smuzhiyun~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
*4882a593SmuzhiyunQ: BPF 32-bit subregisters have a requirement to zero upper 32-bits of BPF
*4882a593Smuzhiyunregisters which makes BPF inefficient virtual machine for 32-bit
*4882a593SmuzhiyunCPU architectures and 32-bit HW accelerators. Can true 32-bit registers
*4882a593Smuzhiyunbe added to BPF in the future?
*4882a593Smuzhiyun
*4882a593SmuzhiyunA: NO.
*4882a593Smuzhiyun
*4882a593SmuzhiyunBut some optimizations on zero-ing the upper 32 bits for BPF registers are
*4882a593Smuzhiyunavailable, and can be leveraged to improve the performance of JITed BPF
*4882a593Smuzhiyunprograms for 32-bit architectures.
*4882a593Smuzhiyun
*4882a593SmuzhiyunStarting with version 7, LLVM is able to generate instructions that operate
*4882a593Smuzhiyunon 32-bit subregisters, provided the option -mattr=+alu32 is passed for
*4882a593Smuzhiyuncompiling a program. Furthermore, the verifier can now mark the
*4882a593Smuzhiyuninstructions for which zero-ing the upper bits of the destination register
*4882a593Smuzhiyunis required, and insert an explicit zero-extension (zext) instruction
*4882a593Smuzhiyun(a mov32 variant). This means that for architectures without zext hardware
*4882a593Smuzhiyunsupport, the JIT back-ends do not need to clear the upper bits for
*4882a593Smuzhiyunsubregisters written by alu32 instructions or narrow loads. Instead, the
*4882a593Smuzhiyunback-ends simply need to support code generation for that mov32 variant,
*4882a593Smuzhiyunand to overwrite bpf_jit_needs_zext() to make it return "true" (in order to
*4882a593Smuzhiyunenable zext insertion in the verifier).
*4882a593Smuzhiyun
*4882a593SmuzhiyunNote that it is possible for a JIT back-end to have partial hardware
*4882a593Smuzhiyunsupport for zext. In that case, if verifier zext insertion is enabled,
*4882a593Smuzhiyunit could lead to the insertion of unnecessary zext instructions. Such
*4882a593Smuzhiyuninstructions could be removed by creating a simple peephole inside the JIT
*4882a593Smuzhiyunback-end: if one instruction has hardware support for zext and if the next
*4882a593Smuzhiyuninstruction is an explicit zext, then the latter can be skipped when doing
*4882a593Smuzhiyunthe code generation.
*4882a593Smuzhiyun
*4882a593SmuzhiyunQ: Does BPF have a stable ABI?
*4882a593Smuzhiyun------------------------------
*4882a593SmuzhiyunA: YES. BPF instructions, arguments to BPF programs, set of helper
*4882a593Smuzhiyunfunctions and their arguments, recognized return codes are all part
*4882a593Smuzhiyunof ABI. However there is one specific exception to tracing programs
*4882a593Smuzhiyunwhich are using helpers like bpf_probe_read() to walk kernel internal
*4882a593Smuzhiyundata structures and compile with kernel internal headers. Both of these
*4882a593Smuzhiyunkernel internals are subject to change and can break with newer kernels
*4882a593Smuzhiyunsuch that the program needs to be adapted accordingly.
*4882a593Smuzhiyun
*4882a593SmuzhiyunQ: How much stack space a BPF program uses?
*4882a593Smuzhiyun-------------------------------------------
*4882a593SmuzhiyunA: Currently all program types are limited to 512 bytes of stack
*4882a593Smuzhiyunspace, but the verifier computes the actual amount of stack used
*4882a593Smuzhiyunand both interpreter and most JITed code consume necessary amount.
*4882a593Smuzhiyun
*4882a593SmuzhiyunQ: Can BPF be offloaded to HW?
*4882a593Smuzhiyun------------------------------
*4882a593SmuzhiyunA: YES. BPF HW offload is supported by NFP driver.
*4882a593Smuzhiyun
*4882a593SmuzhiyunQ: Does classic BPF interpreter still exist?
*4882a593Smuzhiyun--------------------------------------------
*4882a593SmuzhiyunA: NO. Classic BPF programs are converted into extend BPF instructions.
*4882a593Smuzhiyun
*4882a593SmuzhiyunQ: Can BPF call arbitrary kernel functions?
*4882a593Smuzhiyun-------------------------------------------
*4882a593SmuzhiyunA: NO. BPF programs can only call a set of helper functions which
*4882a593Smuzhiyunis defined for every program type.
*4882a593Smuzhiyun
*4882a593SmuzhiyunQ: Can BPF overwrite arbitrary kernel memory?
*4882a593Smuzhiyun---------------------------------------------
*4882a593SmuzhiyunA: NO.
*4882a593Smuzhiyun
*4882a593SmuzhiyunTracing bpf programs can *read* arbitrary memory with bpf_probe_read()
*4882a593Smuzhiyunand bpf_probe_read_str() helpers. Networking programs cannot read
*4882a593Smuzhiyunarbitrary memory, since they don't have access to these helpers.
*4882a593SmuzhiyunPrograms can never read or write arbitrary memory directly.
*4882a593Smuzhiyun
*4882a593SmuzhiyunQ: Can BPF overwrite arbitrary user memory?
*4882a593Smuzhiyun-------------------------------------------
*4882a593SmuzhiyunA: Sort-of.
*4882a593Smuzhiyun
*4882a593SmuzhiyunTracing BPF programs can overwrite the user memory
*4882a593Smuzhiyunof the current task with bpf_probe_write_user(). Every time such
*4882a593Smuzhiyunprogram is loaded the kernel will print warning message, so
*4882a593Smuzhiyunthis helper is only useful for experiments and prototypes.
*4882a593SmuzhiyunTracing BPF programs are root only.
*4882a593Smuzhiyun
*4882a593SmuzhiyunQ: New functionality via kernel modules?
*4882a593Smuzhiyun----------------------------------------
*4882a593SmuzhiyunQ: Can BPF functionality such as new program or map types, new
*4882a593Smuzhiyunhelpers, etc be added out of kernel module code?
*4882a593Smuzhiyun
*4882a593SmuzhiyunA: NO.