xref: /OK3568_Linux_fs/kernel/Documentation/networking/fib_trie.rst (revision 4882a59341e53eb6f0b4789bf948001014eff981)
1*4882a593Smuzhiyun.. SPDX-License-Identifier: GPL-2.0
2*4882a593Smuzhiyun
3*4882a593Smuzhiyun============================
4*4882a593SmuzhiyunLC-trie implementation notes
5*4882a593Smuzhiyun============================
6*4882a593Smuzhiyun
7*4882a593SmuzhiyunNode types
8*4882a593Smuzhiyun----------
9*4882a593Smuzhiyunleaf
10*4882a593Smuzhiyun	An end node with data. This has a copy of the relevant key, along
11*4882a593Smuzhiyun	with 'hlist' with routing table entries sorted by prefix length.
12*4882a593Smuzhiyun	See struct leaf and struct leaf_info.
13*4882a593Smuzhiyun
14*4882a593Smuzhiyuntrie node or tnode
15*4882a593Smuzhiyun	An internal node, holding an array of child (leaf or tnode) pointers,
16*4882a593Smuzhiyun	indexed	through a subset of the key. See Level Compression.
17*4882a593Smuzhiyun
18*4882a593SmuzhiyunA few concepts explained
19*4882a593Smuzhiyun------------------------
20*4882a593SmuzhiyunBits (tnode)
21*4882a593Smuzhiyun	The number of bits in the key segment used for indexing into the
22*4882a593Smuzhiyun	child array - the "child index". See Level Compression.
23*4882a593Smuzhiyun
24*4882a593SmuzhiyunPos (tnode)
25*4882a593Smuzhiyun	The position (in the key) of the key segment used for indexing into
26*4882a593Smuzhiyun	the child array. See Path Compression.
27*4882a593Smuzhiyun
28*4882a593SmuzhiyunPath Compression / skipped bits
29*4882a593Smuzhiyun	Any given tnode is linked to from the child array of its parent, using
30*4882a593Smuzhiyun	a segment of the key specified by the parent's "pos" and "bits"
31*4882a593Smuzhiyun	In certain cases, this tnode's own "pos" will not be immediately
32*4882a593Smuzhiyun	adjacent to the parent (pos+bits), but there will be some bits
33*4882a593Smuzhiyun	in the key skipped over because they represent a single path with no
34*4882a593Smuzhiyun	deviations. These "skipped bits" constitute Path Compression.
35*4882a593Smuzhiyun	Note that the search algorithm will simply skip over these bits when
36*4882a593Smuzhiyun	searching, making it necessary to save the keys in the leaves to
37*4882a593Smuzhiyun	verify that they actually do match the key we are searching for.
38*4882a593Smuzhiyun
39*4882a593SmuzhiyunLevel Compression / child arrays
40*4882a593Smuzhiyun	the trie is kept level balanced moving, under certain conditions, the
41*4882a593Smuzhiyun	children of a full child (see "full_children") up one level, so that
42*4882a593Smuzhiyun	instead of a pure binary tree, each internal node ("tnode") may
43*4882a593Smuzhiyun	contain an arbitrarily large array of links to several children.
44*4882a593Smuzhiyun	Conversely, a tnode with a mostly empty	child array (see empty_children)
45*4882a593Smuzhiyun	may be "halved", having some of its children moved downwards one level,
46*4882a593Smuzhiyun	in order to avoid ever-increasing child arrays.
47*4882a593Smuzhiyun
48*4882a593Smuzhiyunempty_children
49*4882a593Smuzhiyun	the number of positions in the child array of a given tnode that are
50*4882a593Smuzhiyun	NULL.
51*4882a593Smuzhiyun
52*4882a593Smuzhiyunfull_children
53*4882a593Smuzhiyun	the number of children of a given tnode that aren't path compressed.
54*4882a593Smuzhiyun	(in other words, they aren't NULL or leaves and their "pos" is equal
55*4882a593Smuzhiyun	to this	tnode's "pos"+"bits").
56*4882a593Smuzhiyun
57*4882a593Smuzhiyun	(The word "full" here is used more in the sense of "complete" than
58*4882a593Smuzhiyun	as the opposite of "empty", which might be a tad confusing.)
59*4882a593Smuzhiyun
60*4882a593SmuzhiyunComments
61*4882a593Smuzhiyun---------
62*4882a593Smuzhiyun
63*4882a593SmuzhiyunWe have tried to keep the structure of the code as close to fib_hash as
64*4882a593Smuzhiyunpossible to allow verification and help up reviewing.
65*4882a593Smuzhiyun
66*4882a593Smuzhiyunfib_find_node()
67*4882a593Smuzhiyun	A good start for understanding this code. This function implements a
68*4882a593Smuzhiyun	straightforward trie lookup.
69*4882a593Smuzhiyun
70*4882a593Smuzhiyunfib_insert_node()
71*4882a593Smuzhiyun	Inserts a new leaf node in the trie. This is bit more complicated than
72*4882a593Smuzhiyun	fib_find_node(). Inserting a new node means we might have to run the
73*4882a593Smuzhiyun	level compression algorithm on part of the trie.
74*4882a593Smuzhiyun
75*4882a593Smuzhiyuntrie_leaf_remove()
76*4882a593Smuzhiyun	Looks up a key, deletes it and runs the level compression algorithm.
77*4882a593Smuzhiyun
78*4882a593Smuzhiyuntrie_rebalance()
79*4882a593Smuzhiyun	The key function for the dynamic trie after any change in the trie
80*4882a593Smuzhiyun	it is run to optimize and reorganize. It will walk the trie upwards
81*4882a593Smuzhiyun	towards the root from a given tnode, doing a resize() at each step
82*4882a593Smuzhiyun	to implement level compression.
83*4882a593Smuzhiyun
84*4882a593Smuzhiyunresize()
85*4882a593Smuzhiyun	Analyzes a tnode and optimizes the child array size by either inflating
86*4882a593Smuzhiyun	or shrinking it repeatedly until it fulfills the criteria for optimal
87*4882a593Smuzhiyun	level compression. This part follows the original paper pretty closely
88*4882a593Smuzhiyun	and there may be some room for experimentation here.
89*4882a593Smuzhiyun
90*4882a593Smuzhiyuninflate()
91*4882a593Smuzhiyun	Doubles the size of the child array within a tnode. Used by resize().
92*4882a593Smuzhiyun
93*4882a593Smuzhiyunhalve()
94*4882a593Smuzhiyun	Halves the size of the child array within a tnode - the inverse of
95*4882a593Smuzhiyun	inflate(). Used by resize();
96*4882a593Smuzhiyun
97*4882a593Smuzhiyunfn_trie_insert(), fn_trie_delete(), fn_trie_select_default()
98*4882a593Smuzhiyun	The route manipulation functions. Should conform pretty closely to the
99*4882a593Smuzhiyun	corresponding functions in fib_hash.
100*4882a593Smuzhiyun
101*4882a593Smuzhiyunfn_trie_flush()
102*4882a593Smuzhiyun	This walks the full trie (using nextleaf()) and searches for empty
103*4882a593Smuzhiyun	leaves which have to be removed.
104*4882a593Smuzhiyun
105*4882a593Smuzhiyunfn_trie_dump()
106*4882a593Smuzhiyun	Dumps the routing table ordered by prefix length. This is somewhat
107*4882a593Smuzhiyun	slower than the corresponding fib_hash function, as we have to walk the
108*4882a593Smuzhiyun	entire trie for each prefix length. In comparison, fib_hash is organized
109*4882a593Smuzhiyun	as one "zone"/hash per prefix length.
110*4882a593Smuzhiyun
111*4882a593SmuzhiyunLocking
112*4882a593Smuzhiyun-------
113*4882a593Smuzhiyun
114*4882a593Smuzhiyunfib_lock is used for an RW-lock in the same way that this is done in fib_hash.
115*4882a593SmuzhiyunHowever, the functions are somewhat separated for other possible locking
116*4882a593Smuzhiyunscenarios. It might conceivably be possible to run trie_rebalance via RCU
117*4882a593Smuzhiyunto avoid read_lock in the fn_trie_lookup() function.
118*4882a593Smuzhiyun
119*4882a593SmuzhiyunMain lookup mechanism
120*4882a593Smuzhiyun---------------------
121*4882a593Smuzhiyunfn_trie_lookup() is the main lookup function.
122*4882a593Smuzhiyun
123*4882a593SmuzhiyunThe lookup is in its simplest form just like fib_find_node(). We descend the
124*4882a593Smuzhiyuntrie, key segment by key segment, until we find a leaf. check_leaf() does
125*4882a593Smuzhiyunthe fib_semantic_match in the leaf's sorted prefix hlist.
126*4882a593Smuzhiyun
127*4882a593SmuzhiyunIf we find a match, we are done.
128*4882a593Smuzhiyun
129*4882a593SmuzhiyunIf we don't find a match, we enter prefix matching mode. The prefix length,
130*4882a593Smuzhiyunstarting out at the same as the key length, is reduced one step at a time,
131*4882a593Smuzhiyunand we backtrack upwards through the trie trying to find a longest matching
132*4882a593Smuzhiyunprefix. The goal is always to reach a leaf and get a positive result from the
133*4882a593Smuzhiyunfib_semantic_match mechanism.
134*4882a593Smuzhiyun
135*4882a593SmuzhiyunInside each tnode, the search for longest matching prefix consists of searching
136*4882a593Smuzhiyunthrough the child array, chopping off (zeroing) the least significant "1" of
137*4882a593Smuzhiyunthe child index until we find a match or the child index consists of nothing but
138*4882a593Smuzhiyunzeros.
139*4882a593Smuzhiyun
140*4882a593SmuzhiyunAt this point we backtrack (t->stats.backtrack++) up the trie, continuing to
141*4882a593Smuzhiyunchop off part of the key in order to find the longest matching prefix.
142*4882a593Smuzhiyun
143*4882a593SmuzhiyunAt this point we will repeatedly descend subtries to look for a match, and there
144*4882a593Smuzhiyunare some optimizations available that can provide us with "shortcuts" to avoid
145*4882a593Smuzhiyundescending into dead ends. Look for "HL_OPTIMIZE" sections in the code.
146*4882a593Smuzhiyun
147*4882a593SmuzhiyunTo alleviate any doubts about the correctness of the route selection process,
148*4882a593Smuzhiyuna new netlink operation has been added. Look for NETLINK_FIB_LOOKUP, which
149*4882a593Smuzhiyungives userland access to fib_lookup().
150