xref: /OK3568_Linux_fs/kernel/Documentation/powerpc/transactional_memory.rst (revision 4882a59341e53eb6f0b4789bf948001014eff981)
1*4882a593Smuzhiyun============================
2*4882a593SmuzhiyunTransactional Memory support
3*4882a593Smuzhiyun============================
4*4882a593Smuzhiyun
5*4882a593SmuzhiyunPOWER kernel support for this feature is currently limited to supporting
6*4882a593Smuzhiyunits use by user programs.  It is not currently used by the kernel itself.
7*4882a593Smuzhiyun
8*4882a593SmuzhiyunThis file aims to sum up how it is supported by Linux and what behaviour you
9*4882a593Smuzhiyuncan expect from your user programs.
10*4882a593Smuzhiyun
11*4882a593Smuzhiyun
12*4882a593SmuzhiyunBasic overview
13*4882a593Smuzhiyun==============
14*4882a593Smuzhiyun
15*4882a593SmuzhiyunHardware Transactional Memory is supported on POWER8 processors, and is a
16*4882a593Smuzhiyunfeature that enables a different form of atomic memory access.  Several new
17*4882a593Smuzhiyuninstructions are presented to delimit transactions; transactions are
18*4882a593Smuzhiyunguaranteed to either complete atomically or roll back and undo any partial
19*4882a593Smuzhiyunchanges.
20*4882a593Smuzhiyun
21*4882a593SmuzhiyunA simple transaction looks like this::
22*4882a593Smuzhiyun
23*4882a593Smuzhiyun  begin_move_money:
24*4882a593Smuzhiyun    tbegin
25*4882a593Smuzhiyun    beq   abort_handler
26*4882a593Smuzhiyun
27*4882a593Smuzhiyun    ld    r4, SAVINGS_ACCT(r3)
28*4882a593Smuzhiyun    ld    r5, CURRENT_ACCT(r3)
29*4882a593Smuzhiyun    subi  r5, r5, 1
30*4882a593Smuzhiyun    addi  r4, r4, 1
31*4882a593Smuzhiyun    std   r4, SAVINGS_ACCT(r3)
32*4882a593Smuzhiyun    std   r5, CURRENT_ACCT(r3)
33*4882a593Smuzhiyun
34*4882a593Smuzhiyun    tend
35*4882a593Smuzhiyun
36*4882a593Smuzhiyun    b     continue
37*4882a593Smuzhiyun
38*4882a593Smuzhiyun  abort_handler:
39*4882a593Smuzhiyun    ... test for odd failures ...
40*4882a593Smuzhiyun
41*4882a593Smuzhiyun    /* Retry the transaction if it failed because it conflicted with
42*4882a593Smuzhiyun     * someone else: */
43*4882a593Smuzhiyun    b     begin_move_money
44*4882a593Smuzhiyun
45*4882a593Smuzhiyun
46*4882a593SmuzhiyunThe 'tbegin' instruction denotes the start point, and 'tend' the end point.
47*4882a593SmuzhiyunBetween these points the processor is in 'Transactional' state; any memory
48*4882a593Smuzhiyunreferences will complete in one go if there are no conflicts with other
49*4882a593Smuzhiyuntransactional or non-transactional accesses within the system.  In this
50*4882a593Smuzhiyunexample, the transaction completes as though it were normal straight-line code
51*4882a593SmuzhiyunIF no other processor has touched SAVINGS_ACCT(r3) or CURRENT_ACCT(r3); an
52*4882a593Smuzhiyunatomic move of money from the current account to the savings account has been
53*4882a593Smuzhiyunperformed.  Even though the normal ld/std instructions are used (note no
54*4882a593Smuzhiyunlwarx/stwcx), either *both* SAVINGS_ACCT(r3) and CURRENT_ACCT(r3) will be
55*4882a593Smuzhiyunupdated, or neither will be updated.
56*4882a593Smuzhiyun
57*4882a593SmuzhiyunIf, in the meantime, there is a conflict with the locations accessed by the
58*4882a593Smuzhiyuntransaction, the transaction will be aborted by the CPU.  Register and memory
59*4882a593Smuzhiyunstate will roll back to that at the 'tbegin', and control will continue from
60*4882a593Smuzhiyun'tbegin+4'.  The branch to abort_handler will be taken this second time; the
61*4882a593Smuzhiyunabort handler can check the cause of the failure, and retry.
62*4882a593Smuzhiyun
63*4882a593SmuzhiyunCheckpointed registers include all GPRs, FPRs, VRs/VSRs, LR, CCR/CR, CTR, FPCSR
64*4882a593Smuzhiyunand a few other status/flag regs; see the ISA for details.
65*4882a593Smuzhiyun
66*4882a593SmuzhiyunCauses of transaction aborts
67*4882a593Smuzhiyun============================
68*4882a593Smuzhiyun
69*4882a593Smuzhiyun- Conflicts with cache lines used by other processors
70*4882a593Smuzhiyun- Signals
71*4882a593Smuzhiyun- Context switches
72*4882a593Smuzhiyun- See the ISA for full documentation of everything that will abort transactions.
73*4882a593Smuzhiyun
74*4882a593Smuzhiyun
75*4882a593SmuzhiyunSyscalls
76*4882a593Smuzhiyun========
77*4882a593Smuzhiyun
78*4882a593SmuzhiyunSyscalls made from within an active transaction will not be performed and the
79*4882a593Smuzhiyuntransaction will be doomed by the kernel with the failure code TM_CAUSE_SYSCALL
80*4882a593Smuzhiyun| TM_CAUSE_PERSISTENT.
81*4882a593Smuzhiyun
82*4882a593SmuzhiyunSyscalls made from within a suspended transaction are performed as normal and
83*4882a593Smuzhiyunthe transaction is not explicitly doomed by the kernel.  However, what the
84*4882a593Smuzhiyunkernel does to perform the syscall may result in the transaction being doomed
85*4882a593Smuzhiyunby the hardware.  The syscall is performed in suspended mode so any side
86*4882a593Smuzhiyuneffects will be persistent, independent of transaction success or failure.  No
87*4882a593Smuzhiyunguarantees are provided by the kernel about which syscalls will affect
88*4882a593Smuzhiyuntransaction success.
89*4882a593Smuzhiyun
90*4882a593SmuzhiyunCare must be taken when relying on syscalls to abort during active transactions
91*4882a593Smuzhiyunif the calls are made via a library.  Libraries may cache values (which may
92*4882a593Smuzhiyungive the appearance of success) or perform operations that cause transaction
93*4882a593Smuzhiyunfailure before entering the kernel (which may produce different failure codes).
94*4882a593SmuzhiyunExamples are glibc's getpid() and lazy symbol resolution.
95*4882a593Smuzhiyun
96*4882a593Smuzhiyun
97*4882a593SmuzhiyunSignals
98*4882a593Smuzhiyun=======
99*4882a593Smuzhiyun
100*4882a593SmuzhiyunDelivery of signals (both sync and async) during transactions provides a second
101*4882a593Smuzhiyunthread state (ucontext/mcontext) to represent the second transactional register
102*4882a593Smuzhiyunstate.  Signal delivery 'treclaim's to capture both register states, so signals
103*4882a593Smuzhiyunabort transactions.  The usual ucontext_t passed to the signal handler
104*4882a593Smuzhiyunrepresents the checkpointed/original register state; the signal appears to have
105*4882a593Smuzhiyunarisen at 'tbegin+4'.
106*4882a593Smuzhiyun
107*4882a593SmuzhiyunIf the sighandler ucontext has uc_link set, a second ucontext has been
108*4882a593Smuzhiyundelivered.  For future compatibility the MSR.TS field should be checked to
109*4882a593Smuzhiyundetermine the transactional state -- if so, the second ucontext in uc->uc_link
110*4882a593Smuzhiyunrepresents the active transactional registers at the point of the signal.
111*4882a593Smuzhiyun
112*4882a593SmuzhiyunFor 64-bit processes, uc->uc_mcontext.regs->msr is a full 64-bit MSR and its TS
113*4882a593Smuzhiyunfield shows the transactional mode.
114*4882a593Smuzhiyun
115*4882a593SmuzhiyunFor 32-bit processes, the mcontext's MSR register is only 32 bits; the top 32
116*4882a593Smuzhiyunbits are stored in the MSR of the second ucontext, i.e. in
117*4882a593Smuzhiyunuc->uc_link->uc_mcontext.regs->msr.  The top word contains the transactional
118*4882a593Smuzhiyunstate TS.
119*4882a593Smuzhiyun
120*4882a593SmuzhiyunHowever, basic signal handlers don't need to be aware of transactions
121*4882a593Smuzhiyunand simply returning from the handler will deal with things correctly:
122*4882a593Smuzhiyun
123*4882a593SmuzhiyunTransaction-aware signal handlers can read the transactional register state
124*4882a593Smuzhiyunfrom the second ucontext.  This will be necessary for crash handlers to
125*4882a593Smuzhiyundetermine, for example, the address of the instruction causing the SIGSEGV.
126*4882a593Smuzhiyun
127*4882a593SmuzhiyunExample signal handler::
128*4882a593Smuzhiyun
129*4882a593Smuzhiyun    void crash_handler(int sig, siginfo_t *si, void *uc)
130*4882a593Smuzhiyun    {
131*4882a593Smuzhiyun      ucontext_t *ucp = uc;
132*4882a593Smuzhiyun      ucontext_t *transactional_ucp = ucp->uc_link;
133*4882a593Smuzhiyun
134*4882a593Smuzhiyun      if (ucp_link) {
135*4882a593Smuzhiyun        u64 msr = ucp->uc_mcontext.regs->msr;
136*4882a593Smuzhiyun        /* May have transactional ucontext! */
137*4882a593Smuzhiyun  #ifndef __powerpc64__
138*4882a593Smuzhiyun        msr |= ((u64)transactional_ucp->uc_mcontext.regs->msr) << 32;
139*4882a593Smuzhiyun  #endif
140*4882a593Smuzhiyun        if (MSR_TM_ACTIVE(msr)) {
141*4882a593Smuzhiyun           /* Yes, we crashed during a transaction.  Oops. */
142*4882a593Smuzhiyun   fprintf(stderr, "Transaction to be restarted at 0x%llx, but "
143*4882a593Smuzhiyun                           "crashy instruction was at 0x%llx\n",
144*4882a593Smuzhiyun                           ucp->uc_mcontext.regs->nip,
145*4882a593Smuzhiyun                           transactional_ucp->uc_mcontext.regs->nip);
146*4882a593Smuzhiyun        }
147*4882a593Smuzhiyun      }
148*4882a593Smuzhiyun
149*4882a593Smuzhiyun      fix_the_problem(ucp->dar);
150*4882a593Smuzhiyun    }
151*4882a593Smuzhiyun
152*4882a593SmuzhiyunWhen in an active transaction that takes a signal, we need to be careful with
153*4882a593Smuzhiyunthe stack.  It's possible that the stack has moved back up after the tbegin.
154*4882a593SmuzhiyunThe obvious case here is when the tbegin is called inside a function that
155*4882a593Smuzhiyunreturns before a tend.  In this case, the stack is part of the checkpointed
156*4882a593Smuzhiyuntransactional memory state.  If we write over this non transactionally or in
157*4882a593Smuzhiyunsuspend, we are in trouble because if we get a tm abort, the program counter and
158*4882a593Smuzhiyunstack pointer will be back at the tbegin but our in memory stack won't be valid
159*4882a593Smuzhiyunanymore.
160*4882a593Smuzhiyun
161*4882a593SmuzhiyunTo avoid this, when taking a signal in an active transaction, we need to use
162*4882a593Smuzhiyunthe stack pointer from the checkpointed state, rather than the speculated
163*4882a593Smuzhiyunstate.  This ensures that the signal context (written tm suspended) will be
164*4882a593Smuzhiyunwritten below the stack required for the rollback.  The transaction is aborted
165*4882a593Smuzhiyunbecause of the treclaim, so any memory written between the tbegin and the
166*4882a593Smuzhiyunsignal will be rolled back anyway.
167*4882a593Smuzhiyun
168*4882a593SmuzhiyunFor signals taken in non-TM or suspended mode, we use the
169*4882a593Smuzhiyunnormal/non-checkpointed stack pointer.
170*4882a593Smuzhiyun
171*4882a593SmuzhiyunAny transaction initiated inside a sighandler and suspended on return
172*4882a593Smuzhiyunfrom the sighandler to the kernel will get reclaimed and discarded.
173*4882a593Smuzhiyun
174*4882a593SmuzhiyunFailure cause codes used by kernel
175*4882a593Smuzhiyun==================================
176*4882a593Smuzhiyun
177*4882a593SmuzhiyunThese are defined in <asm/reg.h>, and distinguish different reasons why the
178*4882a593Smuzhiyunkernel aborted a transaction:
179*4882a593Smuzhiyun
180*4882a593Smuzhiyun ====================== ================================
181*4882a593Smuzhiyun TM_CAUSE_RESCHED       Thread was rescheduled.
182*4882a593Smuzhiyun TM_CAUSE_TLBI          Software TLB invalid.
183*4882a593Smuzhiyun TM_CAUSE_FAC_UNAV      FP/VEC/VSX unavailable trap.
184*4882a593Smuzhiyun TM_CAUSE_SYSCALL       Syscall from active transaction.
185*4882a593Smuzhiyun TM_CAUSE_SIGNAL        Signal delivered.
186*4882a593Smuzhiyun TM_CAUSE_MISC          Currently unused.
187*4882a593Smuzhiyun TM_CAUSE_ALIGNMENT     Alignment fault.
188*4882a593Smuzhiyun TM_CAUSE_EMULATE       Emulation that touched memory.
189*4882a593Smuzhiyun ====================== ================================
190*4882a593Smuzhiyun
191*4882a593SmuzhiyunThese can be checked by the user program's abort handler as TEXASR[0:7].  If
192*4882a593Smuzhiyunbit 7 is set, it indicates that the error is consider persistent.  For example
193*4882a593Smuzhiyuna TM_CAUSE_ALIGNMENT will be persistent while a TM_CAUSE_RESCHED will not.
194*4882a593Smuzhiyun
195*4882a593SmuzhiyunGDB
196*4882a593Smuzhiyun===
197*4882a593Smuzhiyun
198*4882a593SmuzhiyunGDB and ptrace are not currently TM-aware.  If one stops during a transaction,
199*4882a593Smuzhiyunit looks like the transaction has just started (the checkpointed state is
200*4882a593Smuzhiyunpresented).  The transaction cannot then be continued and will take the failure
201*4882a593Smuzhiyunhandler route.  Furthermore, the transactional 2nd register state will be
202*4882a593Smuzhiyuninaccessible.  GDB can currently be used on programs using TM, but not sensibly
203*4882a593Smuzhiyunin parts within transactions.
204*4882a593Smuzhiyun
205*4882a593SmuzhiyunPOWER9
206*4882a593Smuzhiyun======
207*4882a593Smuzhiyun
208*4882a593SmuzhiyunTM on POWER9 has issues with storing the complete register state. This
209*4882a593Smuzhiyunis described in this commit::
210*4882a593Smuzhiyun
211*4882a593Smuzhiyun    commit 4bb3c7a0208fc13ca70598efd109901a7cd45ae7
212*4882a593Smuzhiyun    Author: Paul Mackerras <paulus@ozlabs.org>
213*4882a593Smuzhiyun    Date:   Wed Mar 21 21:32:01 2018 +1100
214*4882a593Smuzhiyun    KVM: PPC: Book3S HV: Work around transactional memory bugs in POWER9
215*4882a593Smuzhiyun
216*4882a593SmuzhiyunTo account for this different POWER9 chips have TM enabled in
217*4882a593Smuzhiyundifferent ways.
218*4882a593Smuzhiyun
219*4882a593SmuzhiyunOn POWER9N DD2.01 and below, TM is disabled. ie
220*4882a593SmuzhiyunHWCAP2[PPC_FEATURE2_HTM] is not set.
221*4882a593Smuzhiyun
222*4882a593SmuzhiyunOn POWER9N DD2.1 TM is configured by firmware to always abort a
223*4882a593Smuzhiyuntransaction when tm suspend occurs. So tsuspend will cause a
224*4882a593Smuzhiyuntransaction to be aborted and rolled back. Kernel exceptions will also
225*4882a593Smuzhiyuncause the transaction to be aborted and rolled back and the exception
226*4882a593Smuzhiyunwill not occur. If userspace constructs a sigcontext that enables TM
227*4882a593Smuzhiyunsuspend, the sigcontext will be rejected by the kernel. This mode is
228*4882a593Smuzhiyunadvertised to users with HWCAP2[PPC_FEATURE2_HTM_NO_SUSPEND] set.
229*4882a593SmuzhiyunHWCAP2[PPC_FEATURE2_HTM] is not set in this mode.
230*4882a593Smuzhiyun
231*4882a593SmuzhiyunOn POWER9N DD2.2 and above, KVM and POWERVM emulate TM for guests (as
232*4882a593Smuzhiyundescribed in commit 4bb3c7a0208f), hence TM is enabled for guests
233*4882a593Smuzhiyunie. HWCAP2[PPC_FEATURE2_HTM] is set for guest userspace. Guests that
234*4882a593Smuzhiyunmakes heavy use of TM suspend (tsuspend or kernel suspend) will result
235*4882a593Smuzhiyunin traps into the hypervisor and hence will suffer a performance
236*4882a593Smuzhiyundegradation. Host userspace has TM disabled
237*4882a593Smuzhiyunie. HWCAP2[PPC_FEATURE2_HTM] is not set. (although we make enable it
238*4882a593Smuzhiyunat some point in the future if we bring the emulation into host
239*4882a593Smuzhiyunuserspace context switching).
240*4882a593Smuzhiyun
241*4882a593SmuzhiyunPOWER9C DD1.2 and above are only available with POWERVM and hence
242*4882a593SmuzhiyunLinux only runs as a guest. On these systems TM is emulated like on
243*4882a593SmuzhiyunPOWER9N DD2.2.
244*4882a593Smuzhiyun
245*4882a593SmuzhiyunGuest migration from POWER8 to POWER9 will work with POWER9N DD2.2 and
246*4882a593SmuzhiyunPOWER9C DD1.2. Since earlier POWER9 processors don't support TM
247*4882a593Smuzhiyunemulation, migration from POWER8 to POWER9 is not supported there.
248*4882a593Smuzhiyun
249*4882a593SmuzhiyunKernel implementation
250*4882a593Smuzhiyun=====================
251*4882a593Smuzhiyun
252*4882a593Smuzhiyunh/rfid mtmsrd quirk
253*4882a593Smuzhiyun-------------------
254*4882a593Smuzhiyun
255*4882a593SmuzhiyunAs defined in the ISA, rfid has a quirk which is useful in early
256*4882a593Smuzhiyunexception handling. When in a userspace transaction and we enter the
257*4882a593Smuzhiyunkernel via some exception, MSR will end up as TM=0 and TS=01 (ie. TM
258*4882a593Smuzhiyunoff but TM suspended). Regularly the kernel will want change bits in
259*4882a593Smuzhiyunthe MSR and will perform an rfid to do this. In this case rfid can
260*4882a593Smuzhiyunhave SRR0 TM = 0 and TS = 00 (ie. TM off and non transaction) and the
261*4882a593Smuzhiyunresulting MSR will retain TM = 0 and TS=01 from before (ie. stay in
262*4882a593Smuzhiyunsuspend). This is a quirk in the architecture as this would normally
263*4882a593Smuzhiyunbe a transition from TS=01 to TS=00 (ie. suspend -> non transactional)
264*4882a593Smuzhiyunwhich is an illegal transition.
265*4882a593Smuzhiyun
266*4882a593SmuzhiyunThis quirk is described the architecture in the definition of rfid
267*4882a593Smuzhiyunwith these lines:
268*4882a593Smuzhiyun
269*4882a593Smuzhiyun  if (MSR 29:31 ¬ = 0b010 | SRR1 29:31 ¬ = 0b000) then
270*4882a593Smuzhiyun     MSR 29:31 <- SRR1 29:31
271*4882a593Smuzhiyun
272*4882a593Smuzhiyunhrfid and mtmsrd have the same quirk.
273*4882a593Smuzhiyun
274*4882a593SmuzhiyunThe Linux kernel uses this quirk in it's early exception handling.
275