1*4882a593SmuzhiyunDirect Access for files 2*4882a593Smuzhiyun----------------------- 3*4882a593Smuzhiyun 4*4882a593SmuzhiyunMotivation 5*4882a593Smuzhiyun---------- 6*4882a593Smuzhiyun 7*4882a593SmuzhiyunThe page cache is usually used to buffer reads and writes to files. 8*4882a593SmuzhiyunIt is also used to provide the pages which are mapped into userspace 9*4882a593Smuzhiyunby a call to mmap. 10*4882a593Smuzhiyun 11*4882a593SmuzhiyunFor block devices that are memory-like, the page cache pages would be 12*4882a593Smuzhiyununnecessary copies of the original storage. The DAX code removes the 13*4882a593Smuzhiyunextra copy by performing reads and writes directly to the storage device. 14*4882a593SmuzhiyunFor file mappings, the storage device is mapped directly into userspace. 15*4882a593Smuzhiyun 16*4882a593Smuzhiyun 17*4882a593SmuzhiyunUsage 18*4882a593Smuzhiyun----- 19*4882a593Smuzhiyun 20*4882a593SmuzhiyunIf you have a block device which supports DAX, you can make a filesystem 21*4882a593Smuzhiyunon it as usual. The DAX code currently only supports files with a block 22*4882a593Smuzhiyunsize equal to your kernel's PAGE_SIZE, so you may need to specify a block 23*4882a593Smuzhiyunsize when creating the filesystem. 24*4882a593Smuzhiyun 25*4882a593SmuzhiyunCurrently 3 filesystems support DAX: ext2, ext4 and xfs. Enabling DAX on them 26*4882a593Smuzhiyunis different. 27*4882a593Smuzhiyun 28*4882a593SmuzhiyunEnabling DAX on ext2 29*4882a593Smuzhiyun----------------------------- 30*4882a593Smuzhiyun 31*4882a593SmuzhiyunWhen mounting the filesystem, use the "-o dax" option on the command line or 32*4882a593Smuzhiyunadd 'dax' to the options in /etc/fstab. This works to enable DAX on all files 33*4882a593Smuzhiyunwithin the filesystem. It is equivalent to the '-o dax=always' behavior below. 34*4882a593Smuzhiyun 35*4882a593Smuzhiyun 36*4882a593SmuzhiyunEnabling DAX on xfs and ext4 37*4882a593Smuzhiyun---------------------------- 38*4882a593Smuzhiyun 39*4882a593SmuzhiyunSummary 40*4882a593Smuzhiyun------- 41*4882a593Smuzhiyun 42*4882a593Smuzhiyun 1. There exists an in-kernel file access mode flag S_DAX that corresponds to 43*4882a593Smuzhiyun the statx flag STATX_ATTR_DAX. See the manpage for statx(2) for details 44*4882a593Smuzhiyun about this access mode. 45*4882a593Smuzhiyun 46*4882a593Smuzhiyun 2. There exists a persistent flag FS_XFLAG_DAX that can be applied to regular 47*4882a593Smuzhiyun files and directories. This advisory flag can be set or cleared at any 48*4882a593Smuzhiyun time, but doing so does not immediately affect the S_DAX state. 49*4882a593Smuzhiyun 50*4882a593Smuzhiyun 3. If the persistent FS_XFLAG_DAX flag is set on a directory, this flag will 51*4882a593Smuzhiyun be inherited by all regular files and subdirectories that are subsequently 52*4882a593Smuzhiyun created in this directory. Files and subdirectories that exist at the time 53*4882a593Smuzhiyun this flag is set or cleared on the parent directory are not modified by 54*4882a593Smuzhiyun this modification of the parent directory. 55*4882a593Smuzhiyun 56*4882a593Smuzhiyun 4. There exist dax mount options which can override FS_XFLAG_DAX in the 57*4882a593Smuzhiyun setting of the S_DAX flag. Given underlying storage which supports DAX the 58*4882a593Smuzhiyun following hold: 59*4882a593Smuzhiyun 60*4882a593Smuzhiyun "-o dax=inode" means "follow FS_XFLAG_DAX" and is the default. 61*4882a593Smuzhiyun 62*4882a593Smuzhiyun "-o dax=never" means "never set S_DAX, ignore FS_XFLAG_DAX." 63*4882a593Smuzhiyun 64*4882a593Smuzhiyun "-o dax=always" means "always set S_DAX ignore FS_XFLAG_DAX." 65*4882a593Smuzhiyun 66*4882a593Smuzhiyun "-o dax" is a legacy option which is an alias for "dax=always". 67*4882a593Smuzhiyun This may be removed in the future so "-o dax=always" is 68*4882a593Smuzhiyun the preferred method for specifying this behavior. 69*4882a593Smuzhiyun 70*4882a593Smuzhiyun NOTE: Modifications to and the inheritance behavior of FS_XFLAG_DAX remain 71*4882a593Smuzhiyun the same even when the filesystem is mounted with a dax option. However, 72*4882a593Smuzhiyun in-core inode state (S_DAX) will be overridden until the filesystem is 73*4882a593Smuzhiyun remounted with dax=inode and the inode is evicted from kernel memory. 74*4882a593Smuzhiyun 75*4882a593Smuzhiyun 5. The S_DAX policy can be changed via: 76*4882a593Smuzhiyun 77*4882a593Smuzhiyun a) Setting the parent directory FS_XFLAG_DAX as needed before files are 78*4882a593Smuzhiyun created 79*4882a593Smuzhiyun 80*4882a593Smuzhiyun b) Setting the appropriate dax="foo" mount option 81*4882a593Smuzhiyun 82*4882a593Smuzhiyun c) Changing the FS_XFLAG_DAX flag on existing regular files and 83*4882a593Smuzhiyun directories. This has runtime constraints and limitations that are 84*4882a593Smuzhiyun described in 6) below. 85*4882a593Smuzhiyun 86*4882a593Smuzhiyun 6. When changing the S_DAX policy via toggling the persistent FS_XFLAG_DAX flag, 87*4882a593Smuzhiyun the change in behaviour for existing regular files may not occur 88*4882a593Smuzhiyun immediately. If the change must take effect immediately, the administrator 89*4882a593Smuzhiyun needs to: 90*4882a593Smuzhiyun 91*4882a593Smuzhiyun a) stop the application so there are no active references to the data set 92*4882a593Smuzhiyun the policy change will affect 93*4882a593Smuzhiyun 94*4882a593Smuzhiyun b) evict the data set from kernel caches so it will be re-instantiated when 95*4882a593Smuzhiyun the application is restarted. This can be achieved by: 96*4882a593Smuzhiyun 97*4882a593Smuzhiyun i. drop-caches 98*4882a593Smuzhiyun ii. a filesystem unmount and mount cycle 99*4882a593Smuzhiyun iii. a system reboot 100*4882a593Smuzhiyun 101*4882a593Smuzhiyun 102*4882a593SmuzhiyunDetails 103*4882a593Smuzhiyun------- 104*4882a593Smuzhiyun 105*4882a593SmuzhiyunThere are 2 per-file dax flags. One is a persistent inode setting (FS_XFLAG_DAX) 106*4882a593Smuzhiyunand the other is a volatile flag indicating the active state of the feature 107*4882a593Smuzhiyun(S_DAX). 108*4882a593Smuzhiyun 109*4882a593SmuzhiyunFS_XFLAG_DAX is preserved within the filesystem. This persistent config 110*4882a593Smuzhiyunsetting can be set, cleared and/or queried using the FS_IOC_FS[GS]ETXATTR ioctl 111*4882a593Smuzhiyun(see ioctl_xfs_fsgetxattr(2)) or an utility such as 'xfs_io'. 112*4882a593Smuzhiyun 113*4882a593SmuzhiyunNew files and directories automatically inherit FS_XFLAG_DAX from 114*4882a593Smuzhiyuntheir parent directory _when_ _created_. Therefore, setting FS_XFLAG_DAX at 115*4882a593Smuzhiyundirectory creation time can be used to set a default behavior for an entire 116*4882a593Smuzhiyunsub-tree. 117*4882a593Smuzhiyun 118*4882a593SmuzhiyunTo clarify inheritance, here are 3 examples: 119*4882a593Smuzhiyun 120*4882a593SmuzhiyunExample A: 121*4882a593Smuzhiyun 122*4882a593Smuzhiyunmkdir -p a/b/c 123*4882a593Smuzhiyunxfs_io -c 'chattr +x' a 124*4882a593Smuzhiyunmkdir a/b/c/d 125*4882a593Smuzhiyunmkdir a/e 126*4882a593Smuzhiyun 127*4882a593Smuzhiyun dax: a,e 128*4882a593Smuzhiyun no dax: b,c,d 129*4882a593Smuzhiyun 130*4882a593SmuzhiyunExample B: 131*4882a593Smuzhiyun 132*4882a593Smuzhiyunmkdir a 133*4882a593Smuzhiyunxfs_io -c 'chattr +x' a 134*4882a593Smuzhiyunmkdir -p a/b/c/d 135*4882a593Smuzhiyun 136*4882a593Smuzhiyun dax: a,b,c,d 137*4882a593Smuzhiyun no dax: 138*4882a593Smuzhiyun 139*4882a593SmuzhiyunExample C: 140*4882a593Smuzhiyun 141*4882a593Smuzhiyunmkdir -p a/b/c 142*4882a593Smuzhiyunxfs_io -c 'chattr +x' c 143*4882a593Smuzhiyunmkdir a/b/c/d 144*4882a593Smuzhiyun 145*4882a593Smuzhiyun dax: c,d 146*4882a593Smuzhiyun no dax: a,b 147*4882a593Smuzhiyun 148*4882a593Smuzhiyun 149*4882a593SmuzhiyunThe current enabled state (S_DAX) is set when a file inode is instantiated in 150*4882a593Smuzhiyunmemory by the kernel. It is set based on the underlying media support, the 151*4882a593Smuzhiyunvalue of FS_XFLAG_DAX and the filesystem's dax mount option. 152*4882a593Smuzhiyun 153*4882a593Smuzhiyunstatx can be used to query S_DAX. NOTE that only regular files will ever have 154*4882a593SmuzhiyunS_DAX set and therefore statx will never indicate that S_DAX is set on 155*4882a593Smuzhiyundirectories. 156*4882a593Smuzhiyun 157*4882a593SmuzhiyunSetting the FS_XFLAG_DAX flag (specifically or through inheritance) occurs even 158*4882a593Smuzhiyunif the underlying media does not support dax and/or the filesystem is 159*4882a593Smuzhiyunoverridden with a mount option. 160*4882a593Smuzhiyun 161*4882a593Smuzhiyun 162*4882a593Smuzhiyun 163*4882a593SmuzhiyunImplementation Tips for Block Driver Writers 164*4882a593Smuzhiyun-------------------------------------------- 165*4882a593Smuzhiyun 166*4882a593SmuzhiyunTo support DAX in your block driver, implement the 'direct_access' 167*4882a593Smuzhiyunblock device operation. It is used to translate the sector number 168*4882a593Smuzhiyun(expressed in units of 512-byte sectors) to a page frame number (pfn) 169*4882a593Smuzhiyunthat identifies the physical page for the memory. It also returns a 170*4882a593Smuzhiyunkernel virtual address that can be used to access the memory. 171*4882a593Smuzhiyun 172*4882a593SmuzhiyunThe direct_access method takes a 'size' parameter that indicates the 173*4882a593Smuzhiyunnumber of bytes being requested. The function should return the number 174*4882a593Smuzhiyunof bytes that can be contiguously accessed at that offset. It may also 175*4882a593Smuzhiyunreturn a negative errno if an error occurs. 176*4882a593Smuzhiyun 177*4882a593SmuzhiyunIn order to support this method, the storage must be byte-accessible by 178*4882a593Smuzhiyunthe CPU at all times. If your device uses paging techniques to expose 179*4882a593Smuzhiyuna large amount of memory through a smaller window, then you cannot 180*4882a593Smuzhiyunimplement direct_access. Equally, if your device can occasionally 181*4882a593Smuzhiyunstall the CPU for an extended period, you should also not attempt to 182*4882a593Smuzhiyunimplement direct_access. 183*4882a593Smuzhiyun 184*4882a593SmuzhiyunThese block devices may be used for inspiration: 185*4882a593Smuzhiyun- brd: RAM backed block device driver 186*4882a593Smuzhiyun- dcssblk: s390 dcss block device driver 187*4882a593Smuzhiyun- pmem: NVDIMM persistent memory driver 188*4882a593Smuzhiyun 189*4882a593Smuzhiyun 190*4882a593SmuzhiyunImplementation Tips for Filesystem Writers 191*4882a593Smuzhiyun------------------------------------------ 192*4882a593Smuzhiyun 193*4882a593SmuzhiyunFilesystem support consists of 194*4882a593Smuzhiyun- adding support to mark inodes as being DAX by setting the S_DAX flag in 195*4882a593Smuzhiyun i_flags 196*4882a593Smuzhiyun- implementing ->read_iter and ->write_iter operations which use dax_iomap_rw() 197*4882a593Smuzhiyun when inode has S_DAX flag set 198*4882a593Smuzhiyun- implementing an mmap file operation for DAX files which sets the 199*4882a593Smuzhiyun VM_MIXEDMAP and VM_HUGEPAGE flags on the VMA, and setting the vm_ops to 200*4882a593Smuzhiyun include handlers for fault, pmd_fault, page_mkwrite, pfn_mkwrite. These 201*4882a593Smuzhiyun handlers should probably call dax_iomap_fault() passing the appropriate 202*4882a593Smuzhiyun fault size and iomap operations. 203*4882a593Smuzhiyun- calling iomap_zero_range() passing appropriate iomap operations instead of 204*4882a593Smuzhiyun block_truncate_page() for DAX files 205*4882a593Smuzhiyun- ensuring that there is sufficient locking between reads, writes, 206*4882a593Smuzhiyun truncates and page faults 207*4882a593Smuzhiyun 208*4882a593SmuzhiyunThe iomap handlers for allocating blocks must make sure that allocated blocks 209*4882a593Smuzhiyunare zeroed out and converted to written extents before being returned to avoid 210*4882a593Smuzhiyunexposure of uninitialized data through mmap. 211*4882a593Smuzhiyun 212*4882a593SmuzhiyunThese filesystems may be used for inspiration: 213*4882a593Smuzhiyun- ext2: see Documentation/filesystems/ext2.rst 214*4882a593Smuzhiyun- ext4: see Documentation/filesystems/ext4/ 215*4882a593Smuzhiyun- xfs: see Documentation/admin-guide/xfs.rst 216*4882a593Smuzhiyun 217*4882a593Smuzhiyun 218*4882a593SmuzhiyunHandling Media Errors 219*4882a593Smuzhiyun--------------------- 220*4882a593Smuzhiyun 221*4882a593SmuzhiyunThe libnvdimm subsystem stores a record of known media error locations for 222*4882a593Smuzhiyuneach pmem block device (in gendisk->badblocks). If we fault at such location, 223*4882a593Smuzhiyunor one with a latent error not yet discovered, the application can expect 224*4882a593Smuzhiyunto receive a SIGBUS. Libnvdimm also allows clearing of these errors by simply 225*4882a593Smuzhiyunwriting the affected sectors (through the pmem driver, and if the underlying 226*4882a593SmuzhiyunNVDIMM supports the clear_poison DSM defined by ACPI). 227*4882a593Smuzhiyun 228*4882a593SmuzhiyunSince DAX IO normally doesn't go through the driver/bio path, applications or 229*4882a593Smuzhiyunsysadmins have an option to restore the lost data from a prior backup/inbuilt 230*4882a593Smuzhiyunredundancy in the following ways: 231*4882a593Smuzhiyun 232*4882a593Smuzhiyun1. Delete the affected file, and restore from a backup (sysadmin route): 233*4882a593Smuzhiyun This will free the filesystem blocks that were being used by the file, 234*4882a593Smuzhiyun and the next time they're allocated, they will be zeroed first, which 235*4882a593Smuzhiyun happens through the driver, and will clear bad sectors. 236*4882a593Smuzhiyun 237*4882a593Smuzhiyun2. Truncate or hole-punch the part of the file that has a bad-block (at least 238*4882a593Smuzhiyun an entire aligned sector has to be hole-punched, but not necessarily an 239*4882a593Smuzhiyun entire filesystem block). 240*4882a593Smuzhiyun 241*4882a593SmuzhiyunThese are the two basic paths that allow DAX filesystems to continue operating 242*4882a593Smuzhiyunin the presence of media errors. More robust error recovery mechanisms can be 243*4882a593Smuzhiyunbuilt on top of this in the future, for example, involving redundancy/mirroring 244*4882a593Smuzhiyunprovided at the block layer through DM, or additionally, at the filesystem 245*4882a593Smuzhiyunlevel. These would have to rely on the above two tenets, that error clearing 246*4882a593Smuzhiyuncan happen either by sending an IO through the driver, or zeroing (also through 247*4882a593Smuzhiyunthe driver). 248*4882a593Smuzhiyun 249*4882a593Smuzhiyun 250*4882a593SmuzhiyunShortcomings 251*4882a593Smuzhiyun------------ 252*4882a593Smuzhiyun 253*4882a593SmuzhiyunEven if the kernel or its modules are stored on a filesystem that supports 254*4882a593SmuzhiyunDAX on a block device that supports DAX, they will still be copied into RAM. 255*4882a593Smuzhiyun 256*4882a593SmuzhiyunThe DAX code does not work correctly on architectures which have virtually 257*4882a593Smuzhiyunmapped caches such as ARM, MIPS and SPARC. 258*4882a593Smuzhiyun 259*4882a593SmuzhiyunCalling get_user_pages() on a range of user memory that has been mmaped 260*4882a593Smuzhiyunfrom a DAX file will fail when there are no 'struct page' to describe 261*4882a593Smuzhiyunthose pages. This problem has been addressed in some device drivers 262*4882a593Smuzhiyunby adding optional struct page support for pages under the control of 263*4882a593Smuzhiyunthe driver (see CONFIG_NVDIMM_PFN in drivers/nvdimm for an example of 264*4882a593Smuzhiyunhow to do this). In the non struct page cases O_DIRECT reads/writes to 265*4882a593Smuzhiyunthose memory ranges from a non-DAX file will fail (note that O_DIRECT 266*4882a593Smuzhiyunreads/writes _of a DAX file_ do work, it is the memory that is being 267*4882a593Smuzhiyunaccessed that is key here). Other things that will not work in the 268*4882a593Smuzhiyunnon struct page case include RDMA, sendfile() and splice(). 269