1*4882a593Smuzhiyun============================ 2*4882a593SmuzhiyunTransactional Memory support 3*4882a593Smuzhiyun============================ 4*4882a593Smuzhiyun 5*4882a593SmuzhiyunPOWER kernel support for this feature is currently limited to supporting 6*4882a593Smuzhiyunits use by user programs. It is not currently used by the kernel itself. 7*4882a593Smuzhiyun 8*4882a593SmuzhiyunThis file aims to sum up how it is supported by Linux and what behaviour you 9*4882a593Smuzhiyuncan expect from your user programs. 10*4882a593Smuzhiyun 11*4882a593Smuzhiyun 12*4882a593SmuzhiyunBasic overview 13*4882a593Smuzhiyun============== 14*4882a593Smuzhiyun 15*4882a593SmuzhiyunHardware Transactional Memory is supported on POWER8 processors, and is a 16*4882a593Smuzhiyunfeature that enables a different form of atomic memory access. Several new 17*4882a593Smuzhiyuninstructions are presented to delimit transactions; transactions are 18*4882a593Smuzhiyunguaranteed to either complete atomically or roll back and undo any partial 19*4882a593Smuzhiyunchanges. 20*4882a593Smuzhiyun 21*4882a593SmuzhiyunA simple transaction looks like this:: 22*4882a593Smuzhiyun 23*4882a593Smuzhiyun begin_move_money: 24*4882a593Smuzhiyun tbegin 25*4882a593Smuzhiyun beq abort_handler 26*4882a593Smuzhiyun 27*4882a593Smuzhiyun ld r4, SAVINGS_ACCT(r3) 28*4882a593Smuzhiyun ld r5, CURRENT_ACCT(r3) 29*4882a593Smuzhiyun subi r5, r5, 1 30*4882a593Smuzhiyun addi r4, r4, 1 31*4882a593Smuzhiyun std r4, SAVINGS_ACCT(r3) 32*4882a593Smuzhiyun std r5, CURRENT_ACCT(r3) 33*4882a593Smuzhiyun 34*4882a593Smuzhiyun tend 35*4882a593Smuzhiyun 36*4882a593Smuzhiyun b continue 37*4882a593Smuzhiyun 38*4882a593Smuzhiyun abort_handler: 39*4882a593Smuzhiyun ... test for odd failures ... 40*4882a593Smuzhiyun 41*4882a593Smuzhiyun /* Retry the transaction if it failed because it conflicted with 42*4882a593Smuzhiyun * someone else: */ 43*4882a593Smuzhiyun b begin_move_money 44*4882a593Smuzhiyun 45*4882a593Smuzhiyun 46*4882a593SmuzhiyunThe 'tbegin' instruction denotes the start point, and 'tend' the end point. 47*4882a593SmuzhiyunBetween these points the processor is in 'Transactional' state; any memory 48*4882a593Smuzhiyunreferences will complete in one go if there are no conflicts with other 49*4882a593Smuzhiyuntransactional or non-transactional accesses within the system. In this 50*4882a593Smuzhiyunexample, the transaction completes as though it were normal straight-line code 51*4882a593SmuzhiyunIF no other processor has touched SAVINGS_ACCT(r3) or CURRENT_ACCT(r3); an 52*4882a593Smuzhiyunatomic move of money from the current account to the savings account has been 53*4882a593Smuzhiyunperformed. Even though the normal ld/std instructions are used (note no 54*4882a593Smuzhiyunlwarx/stwcx), either *both* SAVINGS_ACCT(r3) and CURRENT_ACCT(r3) will be 55*4882a593Smuzhiyunupdated, or neither will be updated. 56*4882a593Smuzhiyun 57*4882a593SmuzhiyunIf, in the meantime, there is a conflict with the locations accessed by the 58*4882a593Smuzhiyuntransaction, the transaction will be aborted by the CPU. Register and memory 59*4882a593Smuzhiyunstate will roll back to that at the 'tbegin', and control will continue from 60*4882a593Smuzhiyun'tbegin+4'. The branch to abort_handler will be taken this second time; the 61*4882a593Smuzhiyunabort handler can check the cause of the failure, and retry. 62*4882a593Smuzhiyun 63*4882a593SmuzhiyunCheckpointed registers include all GPRs, FPRs, VRs/VSRs, LR, CCR/CR, CTR, FPCSR 64*4882a593Smuzhiyunand a few other status/flag regs; see the ISA for details. 65*4882a593Smuzhiyun 66*4882a593SmuzhiyunCauses of transaction aborts 67*4882a593Smuzhiyun============================ 68*4882a593Smuzhiyun 69*4882a593Smuzhiyun- Conflicts with cache lines used by other processors 70*4882a593Smuzhiyun- Signals 71*4882a593Smuzhiyun- Context switches 72*4882a593Smuzhiyun- See the ISA for full documentation of everything that will abort transactions. 73*4882a593Smuzhiyun 74*4882a593Smuzhiyun 75*4882a593SmuzhiyunSyscalls 76*4882a593Smuzhiyun======== 77*4882a593Smuzhiyun 78*4882a593SmuzhiyunSyscalls made from within an active transaction will not be performed and the 79*4882a593Smuzhiyuntransaction will be doomed by the kernel with the failure code TM_CAUSE_SYSCALL 80*4882a593Smuzhiyun| TM_CAUSE_PERSISTENT. 81*4882a593Smuzhiyun 82*4882a593SmuzhiyunSyscalls made from within a suspended transaction are performed as normal and 83*4882a593Smuzhiyunthe transaction is not explicitly doomed by the kernel. However, what the 84*4882a593Smuzhiyunkernel does to perform the syscall may result in the transaction being doomed 85*4882a593Smuzhiyunby the hardware. The syscall is performed in suspended mode so any side 86*4882a593Smuzhiyuneffects will be persistent, independent of transaction success or failure. No 87*4882a593Smuzhiyunguarantees are provided by the kernel about which syscalls will affect 88*4882a593Smuzhiyuntransaction success. 89*4882a593Smuzhiyun 90*4882a593SmuzhiyunCare must be taken when relying on syscalls to abort during active transactions 91*4882a593Smuzhiyunif the calls are made via a library. Libraries may cache values (which may 92*4882a593Smuzhiyungive the appearance of success) or perform operations that cause transaction 93*4882a593Smuzhiyunfailure before entering the kernel (which may produce different failure codes). 94*4882a593SmuzhiyunExamples are glibc's getpid() and lazy symbol resolution. 95*4882a593Smuzhiyun 96*4882a593Smuzhiyun 97*4882a593SmuzhiyunSignals 98*4882a593Smuzhiyun======= 99*4882a593Smuzhiyun 100*4882a593SmuzhiyunDelivery of signals (both sync and async) during transactions provides a second 101*4882a593Smuzhiyunthread state (ucontext/mcontext) to represent the second transactional register 102*4882a593Smuzhiyunstate. Signal delivery 'treclaim's to capture both register states, so signals 103*4882a593Smuzhiyunabort transactions. The usual ucontext_t passed to the signal handler 104*4882a593Smuzhiyunrepresents the checkpointed/original register state; the signal appears to have 105*4882a593Smuzhiyunarisen at 'tbegin+4'. 106*4882a593Smuzhiyun 107*4882a593SmuzhiyunIf the sighandler ucontext has uc_link set, a second ucontext has been 108*4882a593Smuzhiyundelivered. For future compatibility the MSR.TS field should be checked to 109*4882a593Smuzhiyundetermine the transactional state -- if so, the second ucontext in uc->uc_link 110*4882a593Smuzhiyunrepresents the active transactional registers at the point of the signal. 111*4882a593Smuzhiyun 112*4882a593SmuzhiyunFor 64-bit processes, uc->uc_mcontext.regs->msr is a full 64-bit MSR and its TS 113*4882a593Smuzhiyunfield shows the transactional mode. 114*4882a593Smuzhiyun 115*4882a593SmuzhiyunFor 32-bit processes, the mcontext's MSR register is only 32 bits; the top 32 116*4882a593Smuzhiyunbits are stored in the MSR of the second ucontext, i.e. in 117*4882a593Smuzhiyunuc->uc_link->uc_mcontext.regs->msr. The top word contains the transactional 118*4882a593Smuzhiyunstate TS. 119*4882a593Smuzhiyun 120*4882a593SmuzhiyunHowever, basic signal handlers don't need to be aware of transactions 121*4882a593Smuzhiyunand simply returning from the handler will deal with things correctly: 122*4882a593Smuzhiyun 123*4882a593SmuzhiyunTransaction-aware signal handlers can read the transactional register state 124*4882a593Smuzhiyunfrom the second ucontext. This will be necessary for crash handlers to 125*4882a593Smuzhiyundetermine, for example, the address of the instruction causing the SIGSEGV. 126*4882a593Smuzhiyun 127*4882a593SmuzhiyunExample signal handler:: 128*4882a593Smuzhiyun 129*4882a593Smuzhiyun void crash_handler(int sig, siginfo_t *si, void *uc) 130*4882a593Smuzhiyun { 131*4882a593Smuzhiyun ucontext_t *ucp = uc; 132*4882a593Smuzhiyun ucontext_t *transactional_ucp = ucp->uc_link; 133*4882a593Smuzhiyun 134*4882a593Smuzhiyun if (ucp_link) { 135*4882a593Smuzhiyun u64 msr = ucp->uc_mcontext.regs->msr; 136*4882a593Smuzhiyun /* May have transactional ucontext! */ 137*4882a593Smuzhiyun #ifndef __powerpc64__ 138*4882a593Smuzhiyun msr |= ((u64)transactional_ucp->uc_mcontext.regs->msr) << 32; 139*4882a593Smuzhiyun #endif 140*4882a593Smuzhiyun if (MSR_TM_ACTIVE(msr)) { 141*4882a593Smuzhiyun /* Yes, we crashed during a transaction. Oops. */ 142*4882a593Smuzhiyun fprintf(stderr, "Transaction to be restarted at 0x%llx, but " 143*4882a593Smuzhiyun "crashy instruction was at 0x%llx\n", 144*4882a593Smuzhiyun ucp->uc_mcontext.regs->nip, 145*4882a593Smuzhiyun transactional_ucp->uc_mcontext.regs->nip); 146*4882a593Smuzhiyun } 147*4882a593Smuzhiyun } 148*4882a593Smuzhiyun 149*4882a593Smuzhiyun fix_the_problem(ucp->dar); 150*4882a593Smuzhiyun } 151*4882a593Smuzhiyun 152*4882a593SmuzhiyunWhen in an active transaction that takes a signal, we need to be careful with 153*4882a593Smuzhiyunthe stack. It's possible that the stack has moved back up after the tbegin. 154*4882a593SmuzhiyunThe obvious case here is when the tbegin is called inside a function that 155*4882a593Smuzhiyunreturns before a tend. In this case, the stack is part of the checkpointed 156*4882a593Smuzhiyuntransactional memory state. If we write over this non transactionally or in 157*4882a593Smuzhiyunsuspend, we are in trouble because if we get a tm abort, the program counter and 158*4882a593Smuzhiyunstack pointer will be back at the tbegin but our in memory stack won't be valid 159*4882a593Smuzhiyunanymore. 160*4882a593Smuzhiyun 161*4882a593SmuzhiyunTo avoid this, when taking a signal in an active transaction, we need to use 162*4882a593Smuzhiyunthe stack pointer from the checkpointed state, rather than the speculated 163*4882a593Smuzhiyunstate. This ensures that the signal context (written tm suspended) will be 164*4882a593Smuzhiyunwritten below the stack required for the rollback. The transaction is aborted 165*4882a593Smuzhiyunbecause of the treclaim, so any memory written between the tbegin and the 166*4882a593Smuzhiyunsignal will be rolled back anyway. 167*4882a593Smuzhiyun 168*4882a593SmuzhiyunFor signals taken in non-TM or suspended mode, we use the 169*4882a593Smuzhiyunnormal/non-checkpointed stack pointer. 170*4882a593Smuzhiyun 171*4882a593SmuzhiyunAny transaction initiated inside a sighandler and suspended on return 172*4882a593Smuzhiyunfrom the sighandler to the kernel will get reclaimed and discarded. 173*4882a593Smuzhiyun 174*4882a593SmuzhiyunFailure cause codes used by kernel 175*4882a593Smuzhiyun================================== 176*4882a593Smuzhiyun 177*4882a593SmuzhiyunThese are defined in <asm/reg.h>, and distinguish different reasons why the 178*4882a593Smuzhiyunkernel aborted a transaction: 179*4882a593Smuzhiyun 180*4882a593Smuzhiyun ====================== ================================ 181*4882a593Smuzhiyun TM_CAUSE_RESCHED Thread was rescheduled. 182*4882a593Smuzhiyun TM_CAUSE_TLBI Software TLB invalid. 183*4882a593Smuzhiyun TM_CAUSE_FAC_UNAV FP/VEC/VSX unavailable trap. 184*4882a593Smuzhiyun TM_CAUSE_SYSCALL Syscall from active transaction. 185*4882a593Smuzhiyun TM_CAUSE_SIGNAL Signal delivered. 186*4882a593Smuzhiyun TM_CAUSE_MISC Currently unused. 187*4882a593Smuzhiyun TM_CAUSE_ALIGNMENT Alignment fault. 188*4882a593Smuzhiyun TM_CAUSE_EMULATE Emulation that touched memory. 189*4882a593Smuzhiyun ====================== ================================ 190*4882a593Smuzhiyun 191*4882a593SmuzhiyunThese can be checked by the user program's abort handler as TEXASR[0:7]. If 192*4882a593Smuzhiyunbit 7 is set, it indicates that the error is consider persistent. For example 193*4882a593Smuzhiyuna TM_CAUSE_ALIGNMENT will be persistent while a TM_CAUSE_RESCHED will not. 194*4882a593Smuzhiyun 195*4882a593SmuzhiyunGDB 196*4882a593Smuzhiyun=== 197*4882a593Smuzhiyun 198*4882a593SmuzhiyunGDB and ptrace are not currently TM-aware. If one stops during a transaction, 199*4882a593Smuzhiyunit looks like the transaction has just started (the checkpointed state is 200*4882a593Smuzhiyunpresented). The transaction cannot then be continued and will take the failure 201*4882a593Smuzhiyunhandler route. Furthermore, the transactional 2nd register state will be 202*4882a593Smuzhiyuninaccessible. GDB can currently be used on programs using TM, but not sensibly 203*4882a593Smuzhiyunin parts within transactions. 204*4882a593Smuzhiyun 205*4882a593SmuzhiyunPOWER9 206*4882a593Smuzhiyun====== 207*4882a593Smuzhiyun 208*4882a593SmuzhiyunTM on POWER9 has issues with storing the complete register state. This 209*4882a593Smuzhiyunis described in this commit:: 210*4882a593Smuzhiyun 211*4882a593Smuzhiyun commit 4bb3c7a0208fc13ca70598efd109901a7cd45ae7 212*4882a593Smuzhiyun Author: Paul Mackerras <paulus@ozlabs.org> 213*4882a593Smuzhiyun Date: Wed Mar 21 21:32:01 2018 +1100 214*4882a593Smuzhiyun KVM: PPC: Book3S HV: Work around transactional memory bugs in POWER9 215*4882a593Smuzhiyun 216*4882a593SmuzhiyunTo account for this different POWER9 chips have TM enabled in 217*4882a593Smuzhiyundifferent ways. 218*4882a593Smuzhiyun 219*4882a593SmuzhiyunOn POWER9N DD2.01 and below, TM is disabled. ie 220*4882a593SmuzhiyunHWCAP2[PPC_FEATURE2_HTM] is not set. 221*4882a593Smuzhiyun 222*4882a593SmuzhiyunOn POWER9N DD2.1 TM is configured by firmware to always abort a 223*4882a593Smuzhiyuntransaction when tm suspend occurs. So tsuspend will cause a 224*4882a593Smuzhiyuntransaction to be aborted and rolled back. Kernel exceptions will also 225*4882a593Smuzhiyuncause the transaction to be aborted and rolled back and the exception 226*4882a593Smuzhiyunwill not occur. If userspace constructs a sigcontext that enables TM 227*4882a593Smuzhiyunsuspend, the sigcontext will be rejected by the kernel. This mode is 228*4882a593Smuzhiyunadvertised to users with HWCAP2[PPC_FEATURE2_HTM_NO_SUSPEND] set. 229*4882a593SmuzhiyunHWCAP2[PPC_FEATURE2_HTM] is not set in this mode. 230*4882a593Smuzhiyun 231*4882a593SmuzhiyunOn POWER9N DD2.2 and above, KVM and POWERVM emulate TM for guests (as 232*4882a593Smuzhiyundescribed in commit 4bb3c7a0208f), hence TM is enabled for guests 233*4882a593Smuzhiyunie. HWCAP2[PPC_FEATURE2_HTM] is set for guest userspace. Guests that 234*4882a593Smuzhiyunmakes heavy use of TM suspend (tsuspend or kernel suspend) will result 235*4882a593Smuzhiyunin traps into the hypervisor and hence will suffer a performance 236*4882a593Smuzhiyundegradation. Host userspace has TM disabled 237*4882a593Smuzhiyunie. HWCAP2[PPC_FEATURE2_HTM] is not set. (although we make enable it 238*4882a593Smuzhiyunat some point in the future if we bring the emulation into host 239*4882a593Smuzhiyunuserspace context switching). 240*4882a593Smuzhiyun 241*4882a593SmuzhiyunPOWER9C DD1.2 and above are only available with POWERVM and hence 242*4882a593SmuzhiyunLinux only runs as a guest. On these systems TM is emulated like on 243*4882a593SmuzhiyunPOWER9N DD2.2. 244*4882a593Smuzhiyun 245*4882a593SmuzhiyunGuest migration from POWER8 to POWER9 will work with POWER9N DD2.2 and 246*4882a593SmuzhiyunPOWER9C DD1.2. Since earlier POWER9 processors don't support TM 247*4882a593Smuzhiyunemulation, migration from POWER8 to POWER9 is not supported there. 248*4882a593Smuzhiyun 249*4882a593SmuzhiyunKernel implementation 250*4882a593Smuzhiyun===================== 251*4882a593Smuzhiyun 252*4882a593Smuzhiyunh/rfid mtmsrd quirk 253*4882a593Smuzhiyun------------------- 254*4882a593Smuzhiyun 255*4882a593SmuzhiyunAs defined in the ISA, rfid has a quirk which is useful in early 256*4882a593Smuzhiyunexception handling. When in a userspace transaction and we enter the 257*4882a593Smuzhiyunkernel via some exception, MSR will end up as TM=0 and TS=01 (ie. TM 258*4882a593Smuzhiyunoff but TM suspended). Regularly the kernel will want change bits in 259*4882a593Smuzhiyunthe MSR and will perform an rfid to do this. In this case rfid can 260*4882a593Smuzhiyunhave SRR0 TM = 0 and TS = 00 (ie. TM off and non transaction) and the 261*4882a593Smuzhiyunresulting MSR will retain TM = 0 and TS=01 from before (ie. stay in 262*4882a593Smuzhiyunsuspend). This is a quirk in the architecture as this would normally 263*4882a593Smuzhiyunbe a transition from TS=01 to TS=00 (ie. suspend -> non transactional) 264*4882a593Smuzhiyunwhich is an illegal transition. 265*4882a593Smuzhiyun 266*4882a593SmuzhiyunThis quirk is described the architecture in the definition of rfid 267*4882a593Smuzhiyunwith these lines: 268*4882a593Smuzhiyun 269*4882a593Smuzhiyun if (MSR 29:31 ¬ = 0b010 | SRR1 29:31 ¬ = 0b000) then 270*4882a593Smuzhiyun MSR 29:31 <- SRR1 29:31 271*4882a593Smuzhiyun 272*4882a593Smuzhiyunhrfid and mtmsrd have the same quirk. 273*4882a593Smuzhiyun 274*4882a593SmuzhiyunThe Linux kernel uses this quirk in it's early exception handling. 275