1*4882a593Smuzhiyun.. SPDX-License-Identifier: GPL-2.0 2*4882a593Smuzhiyun 3*4882a593Smuzhiyun=================================== 4*4882a593SmuzhiyunFile management in the Linux kernel 5*4882a593Smuzhiyun=================================== 6*4882a593Smuzhiyun 7*4882a593SmuzhiyunThis document describes how locking for files (struct file) 8*4882a593Smuzhiyunand file descriptor table (struct files) works. 9*4882a593Smuzhiyun 10*4882a593SmuzhiyunUp until 2.6.12, the file descriptor table has been protected 11*4882a593Smuzhiyunwith a lock (files->file_lock) and reference count (files->count). 12*4882a593Smuzhiyun->file_lock protected accesses to all the file related fields 13*4882a593Smuzhiyunof the table. ->count was used for sharing the file descriptor 14*4882a593Smuzhiyuntable between tasks cloned with CLONE_FILES flag. Typically 15*4882a593Smuzhiyunthis would be the case for posix threads. As with the common 16*4882a593Smuzhiyunrefcounting model in the kernel, the last task doing 17*4882a593Smuzhiyuna put_files_struct() frees the file descriptor (fd) table. 18*4882a593SmuzhiyunThe files (struct file) themselves are protected using 19*4882a593Smuzhiyunreference count (->f_count). 20*4882a593Smuzhiyun 21*4882a593SmuzhiyunIn the new lock-free model of file descriptor management, 22*4882a593Smuzhiyunthe reference counting is similar, but the locking is 23*4882a593Smuzhiyunbased on RCU. The file descriptor table contains multiple 24*4882a593Smuzhiyunelements - the fd sets (open_fds and close_on_exec, the 25*4882a593Smuzhiyunarray of file pointers, the sizes of the sets and the array 26*4882a593Smuzhiyunetc.). In order for the updates to appear atomic to 27*4882a593Smuzhiyuna lock-free reader, all the elements of the file descriptor 28*4882a593Smuzhiyuntable are in a separate structure - struct fdtable. 29*4882a593Smuzhiyunfiles_struct contains a pointer to struct fdtable through 30*4882a593Smuzhiyunwhich the actual fd table is accessed. Initially the 31*4882a593Smuzhiyunfdtable is embedded in files_struct itself. On a subsequent 32*4882a593Smuzhiyunexpansion of fdtable, a new fdtable structure is allocated 33*4882a593Smuzhiyunand files->fdtab points to the new structure. The fdtable 34*4882a593Smuzhiyunstructure is freed with RCU and lock-free readers either 35*4882a593Smuzhiyunsee the old fdtable or the new fdtable making the update 36*4882a593Smuzhiyunappear atomic. Here are the locking rules for 37*4882a593Smuzhiyunthe fdtable structure - 38*4882a593Smuzhiyun 39*4882a593Smuzhiyun1. All references to the fdtable must be done through 40*4882a593Smuzhiyun the files_fdtable() macro:: 41*4882a593Smuzhiyun 42*4882a593Smuzhiyun struct fdtable *fdt; 43*4882a593Smuzhiyun 44*4882a593Smuzhiyun rcu_read_lock(); 45*4882a593Smuzhiyun 46*4882a593Smuzhiyun fdt = files_fdtable(files); 47*4882a593Smuzhiyun .... 48*4882a593Smuzhiyun if (n <= fdt->max_fds) 49*4882a593Smuzhiyun .... 50*4882a593Smuzhiyun ... 51*4882a593Smuzhiyun rcu_read_unlock(); 52*4882a593Smuzhiyun 53*4882a593Smuzhiyun files_fdtable() uses rcu_dereference() macro which takes care of 54*4882a593Smuzhiyun the memory barrier requirements for lock-free dereference. 55*4882a593Smuzhiyun The fdtable pointer must be read within the read-side 56*4882a593Smuzhiyun critical section. 57*4882a593Smuzhiyun 58*4882a593Smuzhiyun2. Reading of the fdtable as described above must be protected 59*4882a593Smuzhiyun by rcu_read_lock()/rcu_read_unlock(). 60*4882a593Smuzhiyun 61*4882a593Smuzhiyun3. For any update to the fd table, files->file_lock must 62*4882a593Smuzhiyun be held. 63*4882a593Smuzhiyun 64*4882a593Smuzhiyun4. To look up the file structure given an fd, a reader 65*4882a593Smuzhiyun must use either fcheck() or fcheck_files() APIs. These 66*4882a593Smuzhiyun take care of barrier requirements due to lock-free lookup. 67*4882a593Smuzhiyun 68*4882a593Smuzhiyun An example:: 69*4882a593Smuzhiyun 70*4882a593Smuzhiyun struct file *file; 71*4882a593Smuzhiyun 72*4882a593Smuzhiyun rcu_read_lock(); 73*4882a593Smuzhiyun file = fcheck(fd); 74*4882a593Smuzhiyun if (file) { 75*4882a593Smuzhiyun ... 76*4882a593Smuzhiyun } 77*4882a593Smuzhiyun .... 78*4882a593Smuzhiyun rcu_read_unlock(); 79*4882a593Smuzhiyun 80*4882a593Smuzhiyun5. Handling of the file structures is special. Since the look-up 81*4882a593Smuzhiyun of the fd (fget()/fget_light()) are lock-free, it is possible 82*4882a593Smuzhiyun that look-up may race with the last put() operation on the 83*4882a593Smuzhiyun file structure. This is avoided using atomic_long_inc_not_zero() 84*4882a593Smuzhiyun on ->f_count:: 85*4882a593Smuzhiyun 86*4882a593Smuzhiyun rcu_read_lock(); 87*4882a593Smuzhiyun file = fcheck_files(files, fd); 88*4882a593Smuzhiyun if (file) { 89*4882a593Smuzhiyun if (atomic_long_inc_not_zero(&file->f_count)) 90*4882a593Smuzhiyun *fput_needed = 1; 91*4882a593Smuzhiyun else 92*4882a593Smuzhiyun /* Didn't get the reference, someone's freed */ 93*4882a593Smuzhiyun file = NULL; 94*4882a593Smuzhiyun } 95*4882a593Smuzhiyun rcu_read_unlock(); 96*4882a593Smuzhiyun .... 97*4882a593Smuzhiyun return file; 98*4882a593Smuzhiyun 99*4882a593Smuzhiyun atomic_long_inc_not_zero() detects if refcounts is already zero or 100*4882a593Smuzhiyun goes to zero during increment. If it does, we fail 101*4882a593Smuzhiyun fget()/fget_light(). 102*4882a593Smuzhiyun 103*4882a593Smuzhiyun6. Since both fdtable and file structures can be looked up 104*4882a593Smuzhiyun lock-free, they must be installed using rcu_assign_pointer() 105*4882a593Smuzhiyun API. If they are looked up lock-free, rcu_dereference() 106*4882a593Smuzhiyun must be used. However it is advisable to use files_fdtable() 107*4882a593Smuzhiyun and fcheck()/fcheck_files() which take care of these issues. 108*4882a593Smuzhiyun 109*4882a593Smuzhiyun7. While updating, the fdtable pointer must be looked up while 110*4882a593Smuzhiyun holding files->file_lock. If ->file_lock is dropped, then 111*4882a593Smuzhiyun another thread expand the files thereby creating a new 112*4882a593Smuzhiyun fdtable and making the earlier fdtable pointer stale. 113*4882a593Smuzhiyun 114*4882a593Smuzhiyun For example:: 115*4882a593Smuzhiyun 116*4882a593Smuzhiyun spin_lock(&files->file_lock); 117*4882a593Smuzhiyun fd = locate_fd(files, file, start); 118*4882a593Smuzhiyun if (fd >= 0) { 119*4882a593Smuzhiyun /* locate_fd() may have expanded fdtable, load the ptr */ 120*4882a593Smuzhiyun fdt = files_fdtable(files); 121*4882a593Smuzhiyun __set_open_fd(fd, fdt); 122*4882a593Smuzhiyun __clear_close_on_exec(fd, fdt); 123*4882a593Smuzhiyun spin_unlock(&files->file_lock); 124*4882a593Smuzhiyun ..... 125*4882a593Smuzhiyun 126*4882a593Smuzhiyun Since locate_fd() can drop ->file_lock (and reacquire ->file_lock), 127*4882a593Smuzhiyun the fdtable pointer (fdt) must be loaded after locate_fd(). 128*4882a593Smuzhiyun 129