xref: /OK3568_Linux_fs/kernel/Documentation/x86/orc-unwinder.rst (revision 4882a59341e53eb6f0b4789bf948001014eff981)
1*4882a593Smuzhiyun.. SPDX-License-Identifier: GPL-2.0
2*4882a593Smuzhiyun
3*4882a593Smuzhiyun============
4*4882a593SmuzhiyunORC unwinder
5*4882a593Smuzhiyun============
6*4882a593Smuzhiyun
7*4882a593SmuzhiyunOverview
8*4882a593Smuzhiyun========
9*4882a593Smuzhiyun
10*4882a593SmuzhiyunThe kernel CONFIG_UNWINDER_ORC option enables the ORC unwinder, which is
11*4882a593Smuzhiyunsimilar in concept to a DWARF unwinder.  The difference is that the
12*4882a593Smuzhiyunformat of the ORC data is much simpler than DWARF, which in turn allows
13*4882a593Smuzhiyunthe ORC unwinder to be much simpler and faster.
14*4882a593Smuzhiyun
15*4882a593SmuzhiyunThe ORC data consists of unwind tables which are generated by objtool.
16*4882a593SmuzhiyunThey contain out-of-band data which is used by the in-kernel ORC
17*4882a593Smuzhiyununwinder.  Objtool generates the ORC data by first doing compile-time
18*4882a593Smuzhiyunstack metadata validation (CONFIG_STACK_VALIDATION).  After analyzing
19*4882a593Smuzhiyunall the code paths of a .o file, it determines information about the
20*4882a593Smuzhiyunstack state at each instruction address in the file and outputs that
21*4882a593Smuzhiyuninformation to the .orc_unwind and .orc_unwind_ip sections.
22*4882a593Smuzhiyun
23*4882a593SmuzhiyunThe per-object ORC sections are combined at link time and are sorted and
24*4882a593Smuzhiyunpost-processed at boot time.  The unwinder uses the resulting data to
25*4882a593Smuzhiyuncorrelate instruction addresses with their stack states at run time.
26*4882a593Smuzhiyun
27*4882a593Smuzhiyun
28*4882a593SmuzhiyunORC vs frame pointers
29*4882a593Smuzhiyun=====================
30*4882a593Smuzhiyun
31*4882a593SmuzhiyunWith frame pointers enabled, GCC adds instrumentation code to every
32*4882a593Smuzhiyunfunction in the kernel.  The kernel's .text size increases by about
33*4882a593Smuzhiyun3.2%, resulting in a broad kernel-wide slowdown.  Measurements by Mel
34*4882a593SmuzhiyunGorman [1]_ have shown a slowdown of 5-10% for some workloads.
35*4882a593Smuzhiyun
36*4882a593SmuzhiyunIn contrast, the ORC unwinder has no effect on text size or runtime
37*4882a593Smuzhiyunperformance, because the debuginfo is out of band.  So if you disable
38*4882a593Smuzhiyunframe pointers and enable the ORC unwinder, you get a nice performance
39*4882a593Smuzhiyunimprovement across the board, and still have reliable stack traces.
40*4882a593Smuzhiyun
41*4882a593SmuzhiyunIngo Molnar says:
42*4882a593Smuzhiyun
43*4882a593Smuzhiyun  "Note that it's not just a performance improvement, but also an
44*4882a593Smuzhiyun  instruction cache locality improvement: 3.2% .text savings almost
45*4882a593Smuzhiyun  directly transform into a similarly sized reduction in cache
46*4882a593Smuzhiyun  footprint. That can transform to even higher speedups for workloads
47*4882a593Smuzhiyun  whose cache locality is borderline."
48*4882a593Smuzhiyun
49*4882a593SmuzhiyunAnother benefit of ORC compared to frame pointers is that it can
50*4882a593Smuzhiyunreliably unwind across interrupts and exceptions.  Frame pointer based
51*4882a593Smuzhiyununwinds can sometimes skip the caller of the interrupted function, if it
52*4882a593Smuzhiyunwas a leaf function or if the interrupt hit before the frame pointer was
53*4882a593Smuzhiyunsaved.
54*4882a593Smuzhiyun
55*4882a593SmuzhiyunThe main disadvantage of the ORC unwinder compared to frame pointers is
56*4882a593Smuzhiyunthat it needs more memory to store the ORC unwind tables: roughly 2-4MB
57*4882a593Smuzhiyundepending on the kernel config.
58*4882a593Smuzhiyun
59*4882a593Smuzhiyun
60*4882a593SmuzhiyunORC vs DWARF
61*4882a593Smuzhiyun============
62*4882a593Smuzhiyun
63*4882a593SmuzhiyunORC debuginfo's advantage over DWARF itself is that it's much simpler.
64*4882a593SmuzhiyunIt gets rid of the complex DWARF CFI state machine and also gets rid of
65*4882a593Smuzhiyunthe tracking of unnecessary registers.  This allows the unwinder to be
66*4882a593Smuzhiyunmuch simpler, meaning fewer bugs, which is especially important for
67*4882a593Smuzhiyunmission critical oops code.
68*4882a593Smuzhiyun
69*4882a593SmuzhiyunThe simpler debuginfo format also enables the unwinder to be much faster
70*4882a593Smuzhiyunthan DWARF, which is important for perf and lockdep.  In a basic
71*4882a593Smuzhiyunperformance test by Jiri Slaby [2]_, the ORC unwinder was about 20x
72*4882a593Smuzhiyunfaster than an out-of-tree DWARF unwinder.  (Note: That measurement was
73*4882a593Smuzhiyuntaken before some performance tweaks were added, which doubled
74*4882a593Smuzhiyunperformance, so the speedup over DWARF may be closer to 40x.)
75*4882a593Smuzhiyun
76*4882a593SmuzhiyunThe ORC data format does have a few downsides compared to DWARF.  ORC
77*4882a593Smuzhiyununwind tables take up ~50% more RAM (+1.3MB on an x86 defconfig kernel)
78*4882a593Smuzhiyunthan DWARF-based eh_frame tables.
79*4882a593Smuzhiyun
80*4882a593SmuzhiyunAnother potential downside is that, as GCC evolves, it's conceivable
81*4882a593Smuzhiyunthat the ORC data may end up being *too* simple to describe the state of
82*4882a593Smuzhiyunthe stack for certain optimizations.  But IMO this is unlikely because
83*4882a593SmuzhiyunGCC saves the frame pointer for any unusual stack adjustments it does,
84*4882a593Smuzhiyunso I suspect we'll really only ever need to keep track of the stack
85*4882a593Smuzhiyunpointer and the frame pointer between call frames.  But even if we do
86*4882a593Smuzhiyunend up having to track all the registers DWARF tracks, at least we will
87*4882a593Smuzhiyunstill be able to control the format, e.g. no complex state machines.
88*4882a593Smuzhiyun
89*4882a593Smuzhiyun
90*4882a593SmuzhiyunORC unwind table generation
91*4882a593Smuzhiyun===========================
92*4882a593Smuzhiyun
93*4882a593SmuzhiyunThe ORC data is generated by objtool.  With the existing compile-time
94*4882a593Smuzhiyunstack metadata validation feature, objtool already follows all code
95*4882a593Smuzhiyunpaths, and so it already has all the information it needs to be able to
96*4882a593Smuzhiyungenerate ORC data from scratch.  So it's an easy step to go from stack
97*4882a593Smuzhiyunvalidation to ORC data generation.
98*4882a593Smuzhiyun
99*4882a593SmuzhiyunIt should be possible to instead generate the ORC data with a simple
100*4882a593Smuzhiyuntool which converts DWARF to ORC data.  However, such a solution would
101*4882a593Smuzhiyunbe incomplete due to the kernel's extensive use of asm, inline asm, and
102*4882a593Smuzhiyunspecial sections like exception tables.
103*4882a593Smuzhiyun
104*4882a593SmuzhiyunThat could be rectified by manually annotating those special code paths
105*4882a593Smuzhiyunusing GNU assembler .cfi annotations in .S files, and homegrown
106*4882a593Smuzhiyunannotations for inline asm in .c files.  But asm annotations were tried
107*4882a593Smuzhiyunin the past and were found to be unmaintainable.  They were often
108*4882a593Smuzhiyunincorrect/incomplete and made the code harder to read and keep updated.
109*4882a593SmuzhiyunAnd based on looking at glibc code, annotating inline asm in .c files
110*4882a593Smuzhiyunmight be even worse.
111*4882a593Smuzhiyun
112*4882a593SmuzhiyunObjtool still needs a few annotations, but only in code which does
113*4882a593Smuzhiyununusual things to the stack like entry code.  And even then, far fewer
114*4882a593Smuzhiyunannotations are needed than what DWARF would need, so they're much more
115*4882a593Smuzhiyunmaintainable than DWARF CFI annotations.
116*4882a593Smuzhiyun
117*4882a593SmuzhiyunSo the advantages of using objtool to generate ORC data are that it
118*4882a593Smuzhiyungives more accurate debuginfo, with very few annotations.  It also
119*4882a593Smuzhiyuninsulates the kernel from toolchain bugs which can be very painful to
120*4882a593Smuzhiyundeal with in the kernel since we often have to workaround issues in
121*4882a593Smuzhiyunolder versions of the toolchain for years.
122*4882a593Smuzhiyun
123*4882a593SmuzhiyunThe downside is that the unwinder now becomes dependent on objtool's
124*4882a593Smuzhiyunability to reverse engineer GCC code flow.  If GCC optimizations become
125*4882a593Smuzhiyuntoo complicated for objtool to follow, the ORC data generation might
126*4882a593Smuzhiyunstop working or become incomplete.  (It's worth noting that livepatch
127*4882a593Smuzhiyunalready has such a dependency on objtool's ability to follow GCC code
128*4882a593Smuzhiyunflow.)
129*4882a593Smuzhiyun
130*4882a593SmuzhiyunIf newer versions of GCC come up with some optimizations which break
131*4882a593Smuzhiyunobjtool, we may need to revisit the current implementation.  Some
132*4882a593Smuzhiyunpossible solutions would be asking GCC to make the optimizations more
133*4882a593Smuzhiyunpalatable, or having objtool use DWARF as an additional input, or
134*4882a593Smuzhiyuncreating a GCC plugin to assist objtool with its analysis.  But for now,
135*4882a593Smuzhiyunobjtool follows GCC code quite well.
136*4882a593Smuzhiyun
137*4882a593Smuzhiyun
138*4882a593SmuzhiyunUnwinder implementation details
139*4882a593Smuzhiyun===============================
140*4882a593Smuzhiyun
141*4882a593SmuzhiyunObjtool generates the ORC data by integrating with the compile-time
142*4882a593Smuzhiyunstack metadata validation feature, which is described in detail in
143*4882a593Smuzhiyuntools/objtool/Documentation/stack-validation.txt.  After analyzing all
144*4882a593Smuzhiyunthe code paths of a .o file, it creates an array of orc_entry structs,
145*4882a593Smuzhiyunand a parallel array of instruction addresses associated with those
146*4882a593Smuzhiyunstructs, and writes them to the .orc_unwind and .orc_unwind_ip sections
147*4882a593Smuzhiyunrespectively.
148*4882a593Smuzhiyun
149*4882a593SmuzhiyunThe ORC data is split into the two arrays for performance reasons, to
150*4882a593Smuzhiyunmake the searchable part of the data (.orc_unwind_ip) more compact.  The
151*4882a593Smuzhiyunarrays are sorted in parallel at boot time.
152*4882a593Smuzhiyun
153*4882a593SmuzhiyunPerformance is further improved by the use of a fast lookup table which
154*4882a593Smuzhiyunis created at runtime.  The fast lookup table associates a given address
155*4882a593Smuzhiyunwith a range of indices for the .orc_unwind table, so that only a small
156*4882a593Smuzhiyunsubset of the table needs to be searched.
157*4882a593Smuzhiyun
158*4882a593Smuzhiyun
159*4882a593SmuzhiyunEtymology
160*4882a593Smuzhiyun=========
161*4882a593Smuzhiyun
162*4882a593SmuzhiyunOrcs, fearsome creatures of medieval folklore, are the Dwarves' natural
163*4882a593Smuzhiyunenemies.  Similarly, the ORC unwinder was created in opposition to the
164*4882a593Smuzhiyuncomplexity and slowness of DWARF.
165*4882a593Smuzhiyun
166*4882a593Smuzhiyun"Although Orcs rarely consider multiple solutions to a problem, they do
167*4882a593Smuzhiyunexcel at getting things done because they are creatures of action, not
168*4882a593Smuzhiyunthought." [3]_  Similarly, unlike the esoteric DWARF unwinder, the
169*4882a593Smuzhiyunveracious ORC unwinder wastes no time or siloconic effort decoding
170*4882a593Smuzhiyunvariable-length zero-extended unsigned-integer byte-coded
171*4882a593Smuzhiyunstate-machine-based debug information entries.
172*4882a593Smuzhiyun
173*4882a593SmuzhiyunSimilar to how Orcs frequently unravel the well-intentioned plans of
174*4882a593Smuzhiyuntheir adversaries, the ORC unwinder frequently unravels stacks with
175*4882a593Smuzhiyunbrutal, unyielding efficiency.
176*4882a593Smuzhiyun
177*4882a593SmuzhiyunORC stands for Oops Rewind Capability.
178*4882a593Smuzhiyun
179*4882a593Smuzhiyun
180*4882a593Smuzhiyun.. [1] https://lkml.kernel.org/r/20170602104048.jkkzssljsompjdwy@suse.de
181*4882a593Smuzhiyun.. [2] https://lkml.kernel.org/r/d2ca5435-6386-29b8-db87-7f227c2b713a@suse.cz
182*4882a593Smuzhiyun.. [3] http://dustin.wikidot.com/half-orcs-and-orcs
183