1*4882a593Smuzhiyun.. SPDX-License-Identifier: GPL-2.0 2*4882a593Smuzhiyun 3*4882a593Smuzhiyun============================ 4*4882a593SmuzhiyunGlock internal locking rules 5*4882a593Smuzhiyun============================ 6*4882a593Smuzhiyun 7*4882a593SmuzhiyunThis documents the basic principles of the glock state machine 8*4882a593Smuzhiyuninternals. Each glock (struct gfs2_glock in fs/gfs2/incore.h) 9*4882a593Smuzhiyunhas two main (internal) locks: 10*4882a593Smuzhiyun 11*4882a593Smuzhiyun 1. A spinlock (gl_lockref.lock) which protects the internal state such 12*4882a593Smuzhiyun as gl_state, gl_target and the list of holders (gl_holders) 13*4882a593Smuzhiyun 2. A non-blocking bit lock, GLF_LOCK, which is used to prevent other 14*4882a593Smuzhiyun threads from making calls to the DLM, etc. at the same time. If a 15*4882a593Smuzhiyun thread takes this lock, it must then call run_queue (usually via the 16*4882a593Smuzhiyun workqueue) when it releases it in order to ensure any pending tasks 17*4882a593Smuzhiyun are completed. 18*4882a593Smuzhiyun 19*4882a593SmuzhiyunThe gl_holders list contains all the queued lock requests (not 20*4882a593Smuzhiyunjust the holders) associated with the glock. If there are any 21*4882a593Smuzhiyunheld locks, then they will be contiguous entries at the head 22*4882a593Smuzhiyunof the list. Locks are granted in strictly the order that they 23*4882a593Smuzhiyunare queued, except for those marked LM_FLAG_PRIORITY which are 24*4882a593Smuzhiyunused only during recovery, and even then only for journal locks. 25*4882a593Smuzhiyun 26*4882a593SmuzhiyunThere are three lock states that users of the glock layer can request, 27*4882a593Smuzhiyunnamely shared (SH), deferred (DF) and exclusive (EX). Those translate 28*4882a593Smuzhiyunto the following DLM lock modes: 29*4882a593Smuzhiyun 30*4882a593Smuzhiyun========== ====== ===================================================== 31*4882a593SmuzhiyunGlock mode DLM lock mode 32*4882a593Smuzhiyun========== ====== ===================================================== 33*4882a593Smuzhiyun UN IV/NL Unlocked (no DLM lock associated with glock) or NL 34*4882a593Smuzhiyun SH PR (Protected read) 35*4882a593Smuzhiyun DF CW (Concurrent write) 36*4882a593Smuzhiyun EX EX (Exclusive) 37*4882a593Smuzhiyun========== ====== ===================================================== 38*4882a593Smuzhiyun 39*4882a593SmuzhiyunThus DF is basically a shared mode which is incompatible with the "normal" 40*4882a593Smuzhiyunshared lock mode, SH. In GFS2 the DF mode is used exclusively for direct I/O 41*4882a593Smuzhiyunoperations. The glocks are basically a lock plus some routines which deal 42*4882a593Smuzhiyunwith cache management. The following rules apply for the cache: 43*4882a593Smuzhiyun 44*4882a593Smuzhiyun========== ========== ============== ========== ============== 45*4882a593SmuzhiyunGlock mode Cache data Cache Metadata Dirty Data Dirty Metadata 46*4882a593Smuzhiyun========== ========== ============== ========== ============== 47*4882a593Smuzhiyun UN No No No No 48*4882a593Smuzhiyun SH Yes Yes No No 49*4882a593Smuzhiyun DF No Yes No No 50*4882a593Smuzhiyun EX Yes Yes Yes Yes 51*4882a593Smuzhiyun========== ========== ============== ========== ============== 52*4882a593Smuzhiyun 53*4882a593SmuzhiyunThese rules are implemented using the various glock operations which 54*4882a593Smuzhiyunare defined for each type of glock. Not all types of glocks use 55*4882a593Smuzhiyunall the modes. Only inode glocks use the DF mode for example. 56*4882a593Smuzhiyun 57*4882a593SmuzhiyunTable of glock operations and per type constants: 58*4882a593Smuzhiyun 59*4882a593Smuzhiyun============= ============================================================= 60*4882a593SmuzhiyunField Purpose 61*4882a593Smuzhiyun============= ============================================================= 62*4882a593Smuzhiyungo_xmote_th Called before remote state change (e.g. to sync dirty data) 63*4882a593Smuzhiyungo_xmote_bh Called after remote state change (e.g. to refill cache) 64*4882a593Smuzhiyungo_inval Called if remote state change requires invalidating the cache 65*4882a593Smuzhiyungo_demote_ok Returns boolean value of whether its ok to demote a glock 66*4882a593Smuzhiyun (e.g. checks timeout, and that there is no cached data) 67*4882a593Smuzhiyungo_lock Called for the first local holder of a lock 68*4882a593Smuzhiyungo_unlock Called on the final local unlock of a lock 69*4882a593Smuzhiyungo_dump Called to print content of object for debugfs file, or on 70*4882a593Smuzhiyun error to dump glock to the log. 71*4882a593Smuzhiyungo_type The type of the glock, ``LM_TYPE_*`` 72*4882a593Smuzhiyungo_callback Called if the DLM sends a callback to drop this lock 73*4882a593Smuzhiyungo_flags GLOF_ASPACE is set, if the glock has an address space 74*4882a593Smuzhiyun associated with it 75*4882a593Smuzhiyun============= ============================================================= 76*4882a593Smuzhiyun 77*4882a593SmuzhiyunThe minimum hold time for each lock is the time after a remote lock 78*4882a593Smuzhiyungrant for which we ignore remote demote requests. This is in order to 79*4882a593Smuzhiyunprevent a situation where locks are being bounced around the cluster 80*4882a593Smuzhiyunfrom node to node with none of the nodes making any progress. This 81*4882a593Smuzhiyuntends to show up most with shared mmaped files which are being written 82*4882a593Smuzhiyunto by multiple nodes. By delaying the demotion in response to a 83*4882a593Smuzhiyunremote callback, that gives the userspace program time to make 84*4882a593Smuzhiyunsome progress before the pages are unmapped. 85*4882a593Smuzhiyun 86*4882a593SmuzhiyunThere is a plan to try and remove the go_lock and go_unlock callbacks 87*4882a593Smuzhiyunif possible, in order to try and speed up the fast path though the locking. 88*4882a593SmuzhiyunAlso, eventually we hope to make the glock "EX" mode locally shared 89*4882a593Smuzhiyunsuch that any local locking will be done with the i_mutex as required 90*4882a593Smuzhiyunrather than via the glock. 91*4882a593Smuzhiyun 92*4882a593SmuzhiyunLocking rules for glock operations: 93*4882a593Smuzhiyun 94*4882a593Smuzhiyun============= ====================== ============================= 95*4882a593SmuzhiyunOperation GLF_LOCK bit lock held gl_lockref.lock spinlock held 96*4882a593Smuzhiyun============= ====================== ============================= 97*4882a593Smuzhiyungo_xmote_th Yes No 98*4882a593Smuzhiyungo_xmote_bh Yes No 99*4882a593Smuzhiyungo_inval Yes No 100*4882a593Smuzhiyungo_demote_ok Sometimes Yes 101*4882a593Smuzhiyungo_lock Yes No 102*4882a593Smuzhiyungo_unlock Yes No 103*4882a593Smuzhiyungo_dump Sometimes Yes 104*4882a593Smuzhiyungo_callback Sometimes (N/A) Yes 105*4882a593Smuzhiyun============= ====================== ============================= 106*4882a593Smuzhiyun 107*4882a593Smuzhiyun.. Note:: 108*4882a593Smuzhiyun 109*4882a593Smuzhiyun Operations must not drop either the bit lock or the spinlock 110*4882a593Smuzhiyun if its held on entry. go_dump and do_demote_ok must never block. 111*4882a593Smuzhiyun Note that go_dump will only be called if the glock's state 112*4882a593Smuzhiyun indicates that it is caching uptodate data. 113*4882a593Smuzhiyun 114*4882a593SmuzhiyunGlock locking order within GFS2: 115*4882a593Smuzhiyun 116*4882a593Smuzhiyun 1. i_rwsem (if required) 117*4882a593Smuzhiyun 2. Rename glock (for rename only) 118*4882a593Smuzhiyun 3. Inode glock(s) 119*4882a593Smuzhiyun (Parents before children, inodes at "same level" with same parent in 120*4882a593Smuzhiyun lock number order) 121*4882a593Smuzhiyun 4. Rgrp glock(s) (for (de)allocation operations) 122*4882a593Smuzhiyun 5. Transaction glock (via gfs2_trans_begin) for non-read operations 123*4882a593Smuzhiyun 6. i_rw_mutex (if required) 124*4882a593Smuzhiyun 7. Page lock (always last, very important!) 125*4882a593Smuzhiyun 126*4882a593SmuzhiyunThere are two glocks per inode. One deals with access to the inode 127*4882a593Smuzhiyunitself (locking order as above), and the other, known as the iopen 128*4882a593Smuzhiyunglock is used in conjunction with the i_nlink field in the inode to 129*4882a593Smuzhiyundetermine the lifetime of the inode in question. Locking of inodes 130*4882a593Smuzhiyunis on a per-inode basis. Locking of rgrps is on a per rgrp basis. 131*4882a593SmuzhiyunIn general we prefer to lock local locks prior to cluster locks. 132*4882a593Smuzhiyun 133*4882a593SmuzhiyunGlock Statistics 134*4882a593Smuzhiyun---------------- 135*4882a593Smuzhiyun 136*4882a593SmuzhiyunThe stats are divided into two sets: those relating to the 137*4882a593Smuzhiyunsuper block and those relating to an individual glock. The 138*4882a593Smuzhiyunsuper block stats are done on a per cpu basis in order to 139*4882a593Smuzhiyuntry and reduce the overhead of gathering them. They are also 140*4882a593Smuzhiyunfurther divided by glock type. All timings are in nanoseconds. 141*4882a593Smuzhiyun 142*4882a593SmuzhiyunIn the case of both the super block and glock statistics, 143*4882a593Smuzhiyunthe same information is gathered in each case. The super 144*4882a593Smuzhiyunblock timing statistics are used to provide default values for 145*4882a593Smuzhiyunthe glock timing statistics, so that newly created glocks 146*4882a593Smuzhiyunshould have, as far as possible, a sensible starting point. 147*4882a593SmuzhiyunThe per-glock counters are initialised to zero when the 148*4882a593Smuzhiyunglock is created. The per-glock statistics are lost when 149*4882a593Smuzhiyunthe glock is ejected from memory. 150*4882a593Smuzhiyun 151*4882a593SmuzhiyunThe statistics are divided into three pairs of mean and 152*4882a593Smuzhiyunvariance, plus two counters. The mean/variance pairs are 153*4882a593Smuzhiyunsmoothed exponential estimates and the algorithm used is 154*4882a593Smuzhiyunone which will be very familiar to those used to calculation 155*4882a593Smuzhiyunof round trip times in network code. See "TCP/IP Illustrated, 156*4882a593SmuzhiyunVolume 1", W. Richard Stevens, sect 21.3, "Round-Trip Time Measurement", 157*4882a593Smuzhiyunp. 299 and onwards. Also, Volume 2, Sect. 25.10, p. 838 and onwards. 158*4882a593SmuzhiyunUnlike the TCP/IP Illustrated case, the mean and variance are 159*4882a593Smuzhiyunnot scaled, but are in units of integer nanoseconds. 160*4882a593Smuzhiyun 161*4882a593SmuzhiyunThe three pairs of mean/variance measure the following 162*4882a593Smuzhiyunthings: 163*4882a593Smuzhiyun 164*4882a593Smuzhiyun 1. DLM lock time (non-blocking requests) 165*4882a593Smuzhiyun 2. DLM lock time (blocking requests) 166*4882a593Smuzhiyun 3. Inter-request time (again to the DLM) 167*4882a593Smuzhiyun 168*4882a593SmuzhiyunA non-blocking request is one which will complete right 169*4882a593Smuzhiyunaway, whatever the state of the DLM lock in question. That 170*4882a593Smuzhiyuncurrently means any requests when (a) the current state of 171*4882a593Smuzhiyunthe lock is exclusive, i.e. a lock demotion (b) the requested 172*4882a593Smuzhiyunstate is either null or unlocked (again, a demotion) or (c) the 173*4882a593Smuzhiyun"try lock" flag is set. A blocking request covers all the other 174*4882a593Smuzhiyunlock requests. 175*4882a593Smuzhiyun 176*4882a593SmuzhiyunThere are two counters. The first is there primarily to show 177*4882a593Smuzhiyunhow many lock requests have been made, and thus how much data 178*4882a593Smuzhiyunhas gone into the mean/variance calculations. The other counter 179*4882a593Smuzhiyunis counting queuing of holders at the top layer of the glock 180*4882a593Smuzhiyuncode. Hopefully that number will be a lot larger than the number 181*4882a593Smuzhiyunof dlm lock requests issued. 182*4882a593Smuzhiyun 183*4882a593SmuzhiyunSo why gather these statistics? There are several reasons 184*4882a593Smuzhiyunwe'd like to get a better idea of these timings: 185*4882a593Smuzhiyun 186*4882a593Smuzhiyun1. To be able to better set the glock "min hold time" 187*4882a593Smuzhiyun2. To spot performance issues more easily 188*4882a593Smuzhiyun3. To improve the algorithm for selecting resource groups for 189*4882a593Smuzhiyun allocation (to base it on lock wait time, rather than blindly 190*4882a593Smuzhiyun using a "try lock") 191*4882a593Smuzhiyun 192*4882a593SmuzhiyunDue to the smoothing action of the updates, a step change in 193*4882a593Smuzhiyunsome input quantity being sampled will only fully be taken 194*4882a593Smuzhiyuninto account after 8 samples (or 4 for the variance) and this 195*4882a593Smuzhiyunneeds to be carefully considered when interpreting the 196*4882a593Smuzhiyunresults. 197*4882a593Smuzhiyun 198*4882a593SmuzhiyunKnowing both the time it takes a lock request to complete and 199*4882a593Smuzhiyunthe average time between lock requests for a glock means we 200*4882a593Smuzhiyuncan compute the total percentage of the time for which the 201*4882a593Smuzhiyunnode is able to use a glock vs. time that the rest of the 202*4882a593Smuzhiyuncluster has its share. That will be very useful when setting 203*4882a593Smuzhiyunthe lock min hold time. 204*4882a593Smuzhiyun 205*4882a593SmuzhiyunGreat care has been taken to ensure that we 206*4882a593Smuzhiyunmeasure exactly the quantities that we want, as accurately 207*4882a593Smuzhiyunas possible. There are always inaccuracies in any 208*4882a593Smuzhiyunmeasuring system, but I hope this is as accurate as we 209*4882a593Smuzhiyuncan reasonably make it. 210*4882a593Smuzhiyun 211*4882a593SmuzhiyunPer sb stats can be found here:: 212*4882a593Smuzhiyun 213*4882a593Smuzhiyun /sys/kernel/debug/gfs2/<fsname>/sbstats 214*4882a593Smuzhiyun 215*4882a593SmuzhiyunPer glock stats can be found here:: 216*4882a593Smuzhiyun 217*4882a593Smuzhiyun /sys/kernel/debug/gfs2/<fsname>/glstats 218*4882a593Smuzhiyun 219*4882a593SmuzhiyunAssuming that debugfs is mounted on /sys/kernel/debug and also 220*4882a593Smuzhiyunthat <fsname> is replaced with the name of the gfs2 filesystem 221*4882a593Smuzhiyunin question. 222*4882a593Smuzhiyun 223*4882a593SmuzhiyunThe abbreviations used in the output as are follows: 224*4882a593Smuzhiyun 225*4882a593Smuzhiyun========= ================================================================ 226*4882a593Smuzhiyunsrtt Smoothed round trip time for non blocking dlm requests 227*4882a593Smuzhiyunsrttvar Variance estimate for srtt 228*4882a593Smuzhiyunsrttb Smoothed round trip time for (potentially) blocking dlm requests 229*4882a593Smuzhiyunsrttvarb Variance estimate for srttb 230*4882a593Smuzhiyunsirt Smoothed inter request time (for dlm requests) 231*4882a593Smuzhiyunsirtvar Variance estimate for sirt 232*4882a593Smuzhiyundlm Number of dlm requests made (dcnt in glstats file) 233*4882a593Smuzhiyunqueue Number of glock requests queued (qcnt in glstats file) 234*4882a593Smuzhiyun========= ================================================================ 235*4882a593Smuzhiyun 236*4882a593SmuzhiyunThe sbstats file contains a set of these stats for each glock type (so 8 lines 237*4882a593Smuzhiyunfor each type) and for each cpu (one column per cpu). The glstats file contains 238*4882a593Smuzhiyuna set of these stats for each glock in a similar format to the glocks file, but 239*4882a593Smuzhiyunusing the format mean/variance for each of the timing stats. 240*4882a593Smuzhiyun 241*4882a593SmuzhiyunThe gfs2_glock_lock_time tracepoint prints out the current values of the stats 242*4882a593Smuzhiyunfor the glock in question, along with some addition information on each dlm 243*4882a593Smuzhiyunreply that is received: 244*4882a593Smuzhiyun 245*4882a593Smuzhiyun====== ======================================= 246*4882a593Smuzhiyunstatus The status of the dlm request 247*4882a593Smuzhiyunflags The dlm request flags 248*4882a593Smuzhiyuntdiff The time taken by this specific request 249*4882a593Smuzhiyun====== ======================================= 250*4882a593Smuzhiyun 251*4882a593Smuzhiyun(remaining fields as per above list) 252*4882a593Smuzhiyun 253*4882a593Smuzhiyun 254