1*4882a593Smuzhiyun.. SPDX-License-Identifier: GPL-2.0 2*4882a593Smuzhiyun 3*4882a593Smuzhiyun===================================== 4*4882a593SmuzhiyunAsynchronous Transfers/Transforms API 5*4882a593Smuzhiyun===================================== 6*4882a593Smuzhiyun 7*4882a593Smuzhiyun.. Contents 8*4882a593Smuzhiyun 9*4882a593Smuzhiyun 1. INTRODUCTION 10*4882a593Smuzhiyun 11*4882a593Smuzhiyun 2 GENEALOGY 12*4882a593Smuzhiyun 13*4882a593Smuzhiyun 3 USAGE 14*4882a593Smuzhiyun 3.1 General format of the API 15*4882a593Smuzhiyun 3.2 Supported operations 16*4882a593Smuzhiyun 3.3 Descriptor management 17*4882a593Smuzhiyun 3.4 When does the operation execute? 18*4882a593Smuzhiyun 3.5 When does the operation complete? 19*4882a593Smuzhiyun 3.6 Constraints 20*4882a593Smuzhiyun 3.7 Example 21*4882a593Smuzhiyun 22*4882a593Smuzhiyun 4 DMAENGINE DRIVER DEVELOPER NOTES 23*4882a593Smuzhiyun 4.1 Conformance points 24*4882a593Smuzhiyun 4.2 "My application needs exclusive control of hardware channels" 25*4882a593Smuzhiyun 26*4882a593Smuzhiyun 5 SOURCE 27*4882a593Smuzhiyun 28*4882a593Smuzhiyun1. Introduction 29*4882a593Smuzhiyun=============== 30*4882a593Smuzhiyun 31*4882a593SmuzhiyunThe async_tx API provides methods for describing a chain of asynchronous 32*4882a593Smuzhiyunbulk memory transfers/transforms with support for inter-transactional 33*4882a593Smuzhiyundependencies. It is implemented as a dmaengine client that smooths over 34*4882a593Smuzhiyunthe details of different hardware offload engine implementations. Code 35*4882a593Smuzhiyunthat is written to the API can optimize for asynchronous operation and 36*4882a593Smuzhiyunthe API will fit the chain of operations to the available offload 37*4882a593Smuzhiyunresources. 38*4882a593Smuzhiyun 39*4882a593Smuzhiyun2.Genealogy 40*4882a593Smuzhiyun=========== 41*4882a593Smuzhiyun 42*4882a593SmuzhiyunThe API was initially designed to offload the memory copy and 43*4882a593Smuzhiyunxor-parity-calculations of the md-raid5 driver using the offload engines 44*4882a593Smuzhiyunpresent in the Intel(R) Xscale series of I/O processors. It also built 45*4882a593Smuzhiyunon the 'dmaengine' layer developed for offloading memory copies in the 46*4882a593Smuzhiyunnetwork stack using Intel(R) I/OAT engines. The following design 47*4882a593Smuzhiyunfeatures surfaced as a result: 48*4882a593Smuzhiyun 49*4882a593Smuzhiyun1. implicit synchronous path: users of the API do not need to know if 50*4882a593Smuzhiyun the platform they are running on has offload capabilities. The 51*4882a593Smuzhiyun operation will be offloaded when an engine is available and carried out 52*4882a593Smuzhiyun in software otherwise. 53*4882a593Smuzhiyun2. cross channel dependency chains: the API allows a chain of dependent 54*4882a593Smuzhiyun operations to be submitted, like xor->copy->xor in the raid5 case. The 55*4882a593Smuzhiyun API automatically handles cases where the transition from one operation 56*4882a593Smuzhiyun to another implies a hardware channel switch. 57*4882a593Smuzhiyun3. dmaengine extensions to support multiple clients and operation types 58*4882a593Smuzhiyun beyond 'memcpy' 59*4882a593Smuzhiyun 60*4882a593Smuzhiyun3. Usage 61*4882a593Smuzhiyun======== 62*4882a593Smuzhiyun 63*4882a593Smuzhiyun3.1 General format of the API 64*4882a593Smuzhiyun----------------------------- 65*4882a593Smuzhiyun 66*4882a593Smuzhiyun:: 67*4882a593Smuzhiyun 68*4882a593Smuzhiyun struct dma_async_tx_descriptor * 69*4882a593Smuzhiyun async_<operation>(<op specific parameters>, struct async_submit ctl *submit) 70*4882a593Smuzhiyun 71*4882a593Smuzhiyun3.2 Supported operations 72*4882a593Smuzhiyun------------------------ 73*4882a593Smuzhiyun 74*4882a593Smuzhiyun======== ==================================================================== 75*4882a593Smuzhiyunmemcpy memory copy between a source and a destination buffer 76*4882a593Smuzhiyunmemset fill a destination buffer with a byte value 77*4882a593Smuzhiyunxor xor a series of source buffers and write the result to a 78*4882a593Smuzhiyun destination buffer 79*4882a593Smuzhiyunxor_val xor a series of source buffers and set a flag if the 80*4882a593Smuzhiyun result is zero. The implementation attempts to prevent 81*4882a593Smuzhiyun writes to memory 82*4882a593Smuzhiyunpq generate the p+q (raid6 syndrome) from a series of source buffers 83*4882a593Smuzhiyunpq_val validate that a p and or q buffer are in sync with a given series of 84*4882a593Smuzhiyun sources 85*4882a593Smuzhiyundatap (raid6_datap_recov) recover a raid6 data block and the p block 86*4882a593Smuzhiyun from the given sources 87*4882a593Smuzhiyun2data (raid6_2data_recov) recover 2 raid6 data blocks from the given 88*4882a593Smuzhiyun sources 89*4882a593Smuzhiyun======== ==================================================================== 90*4882a593Smuzhiyun 91*4882a593Smuzhiyun3.3 Descriptor management 92*4882a593Smuzhiyun------------------------- 93*4882a593Smuzhiyun 94*4882a593SmuzhiyunThe return value is non-NULL and points to a 'descriptor' when the operation 95*4882a593Smuzhiyunhas been queued to execute asynchronously. Descriptors are recycled 96*4882a593Smuzhiyunresources, under control of the offload engine driver, to be reused as 97*4882a593Smuzhiyunoperations complete. When an application needs to submit a chain of 98*4882a593Smuzhiyunoperations it must guarantee that the descriptor is not automatically recycled 99*4882a593Smuzhiyunbefore the dependency is submitted. This requires that all descriptors be 100*4882a593Smuzhiyunacknowledged by the application before the offload engine driver is allowed to 101*4882a593Smuzhiyunrecycle (or free) the descriptor. A descriptor can be acked by one of the 102*4882a593Smuzhiyunfollowing methods: 103*4882a593Smuzhiyun 104*4882a593Smuzhiyun1. setting the ASYNC_TX_ACK flag if no child operations are to be submitted 105*4882a593Smuzhiyun2. submitting an unacknowledged descriptor as a dependency to another 106*4882a593Smuzhiyun async_tx call will implicitly set the acknowledged state. 107*4882a593Smuzhiyun3. calling async_tx_ack() on the descriptor. 108*4882a593Smuzhiyun 109*4882a593Smuzhiyun3.4 When does the operation execute? 110*4882a593Smuzhiyun------------------------------------ 111*4882a593Smuzhiyun 112*4882a593SmuzhiyunOperations do not immediately issue after return from the 113*4882a593Smuzhiyunasync_<operation> call. Offload engine drivers batch operations to 114*4882a593Smuzhiyunimprove performance by reducing the number of mmio cycles needed to 115*4882a593Smuzhiyunmanage the channel. Once a driver-specific threshold is met the driver 116*4882a593Smuzhiyunautomatically issues pending operations. An application can force this 117*4882a593Smuzhiyunevent by calling async_tx_issue_pending_all(). This operates on all 118*4882a593Smuzhiyunchannels since the application has no knowledge of channel to operation 119*4882a593Smuzhiyunmapping. 120*4882a593Smuzhiyun 121*4882a593Smuzhiyun3.5 When does the operation complete? 122*4882a593Smuzhiyun------------------------------------- 123*4882a593Smuzhiyun 124*4882a593SmuzhiyunThere are two methods for an application to learn about the completion 125*4882a593Smuzhiyunof an operation. 126*4882a593Smuzhiyun 127*4882a593Smuzhiyun1. Call dma_wait_for_async_tx(). This call causes the CPU to spin while 128*4882a593Smuzhiyun it polls for the completion of the operation. It handles dependency 129*4882a593Smuzhiyun chains and issuing pending operations. 130*4882a593Smuzhiyun2. Specify a completion callback. The callback routine runs in tasklet 131*4882a593Smuzhiyun context if the offload engine driver supports interrupts, or it is 132*4882a593Smuzhiyun called in application context if the operation is carried out 133*4882a593Smuzhiyun synchronously in software. The callback can be set in the call to 134*4882a593Smuzhiyun async_<operation>, or when the application needs to submit a chain of 135*4882a593Smuzhiyun unknown length it can use the async_trigger_callback() routine to set a 136*4882a593Smuzhiyun completion interrupt/callback at the end of the chain. 137*4882a593Smuzhiyun 138*4882a593Smuzhiyun3.6 Constraints 139*4882a593Smuzhiyun--------------- 140*4882a593Smuzhiyun 141*4882a593Smuzhiyun1. Calls to async_<operation> are not permitted in IRQ context. Other 142*4882a593Smuzhiyun contexts are permitted provided constraint #2 is not violated. 143*4882a593Smuzhiyun2. Completion callback routines cannot submit new operations. This 144*4882a593Smuzhiyun results in recursion in the synchronous case and spin_locks being 145*4882a593Smuzhiyun acquired twice in the asynchronous case. 146*4882a593Smuzhiyun 147*4882a593Smuzhiyun3.7 Example 148*4882a593Smuzhiyun----------- 149*4882a593Smuzhiyun 150*4882a593SmuzhiyunPerform a xor->copy->xor operation where each operation depends on the 151*4882a593Smuzhiyunresult from the previous operation:: 152*4882a593Smuzhiyun 153*4882a593Smuzhiyun void callback(void *param) 154*4882a593Smuzhiyun { 155*4882a593Smuzhiyun struct completion *cmp = param; 156*4882a593Smuzhiyun 157*4882a593Smuzhiyun complete(cmp); 158*4882a593Smuzhiyun } 159*4882a593Smuzhiyun 160*4882a593Smuzhiyun void run_xor_copy_xor(struct page **xor_srcs, 161*4882a593Smuzhiyun int xor_src_cnt, 162*4882a593Smuzhiyun struct page *xor_dest, 163*4882a593Smuzhiyun size_t xor_len, 164*4882a593Smuzhiyun struct page *copy_src, 165*4882a593Smuzhiyun struct page *copy_dest, 166*4882a593Smuzhiyun size_t copy_len) 167*4882a593Smuzhiyun { 168*4882a593Smuzhiyun struct dma_async_tx_descriptor *tx; 169*4882a593Smuzhiyun addr_conv_t addr_conv[xor_src_cnt]; 170*4882a593Smuzhiyun struct async_submit_ctl submit; 171*4882a593Smuzhiyun addr_conv_t addr_conv[NDISKS]; 172*4882a593Smuzhiyun struct completion cmp; 173*4882a593Smuzhiyun 174*4882a593Smuzhiyun init_async_submit(&submit, ASYNC_TX_XOR_DROP_DST, NULL, NULL, NULL, 175*4882a593Smuzhiyun addr_conv); 176*4882a593Smuzhiyun tx = async_xor(xor_dest, xor_srcs, 0, xor_src_cnt, xor_len, &submit) 177*4882a593Smuzhiyun 178*4882a593Smuzhiyun submit->depend_tx = tx; 179*4882a593Smuzhiyun tx = async_memcpy(copy_dest, copy_src, 0, 0, copy_len, &submit); 180*4882a593Smuzhiyun 181*4882a593Smuzhiyun init_completion(&cmp); 182*4882a593Smuzhiyun init_async_submit(&submit, ASYNC_TX_XOR_DROP_DST | ASYNC_TX_ACK, tx, 183*4882a593Smuzhiyun callback, &cmp, addr_conv); 184*4882a593Smuzhiyun tx = async_xor(xor_dest, xor_srcs, 0, xor_src_cnt, xor_len, &submit); 185*4882a593Smuzhiyun 186*4882a593Smuzhiyun async_tx_issue_pending_all(); 187*4882a593Smuzhiyun 188*4882a593Smuzhiyun wait_for_completion(&cmp); 189*4882a593Smuzhiyun } 190*4882a593Smuzhiyun 191*4882a593SmuzhiyunSee include/linux/async_tx.h for more information on the flags. See the 192*4882a593Smuzhiyunops_run_* and ops_complete_* routines in drivers/md/raid5.c for more 193*4882a593Smuzhiyunimplementation examples. 194*4882a593Smuzhiyun 195*4882a593Smuzhiyun4. Driver Development Notes 196*4882a593Smuzhiyun=========================== 197*4882a593Smuzhiyun 198*4882a593Smuzhiyun4.1 Conformance points 199*4882a593Smuzhiyun---------------------- 200*4882a593Smuzhiyun 201*4882a593SmuzhiyunThere are a few conformance points required in dmaengine drivers to 202*4882a593Smuzhiyunaccommodate assumptions made by applications using the async_tx API: 203*4882a593Smuzhiyun 204*4882a593Smuzhiyun1. Completion callbacks are expected to happen in tasklet context 205*4882a593Smuzhiyun2. dma_async_tx_descriptor fields are never manipulated in IRQ context 206*4882a593Smuzhiyun3. Use async_tx_run_dependencies() in the descriptor clean up path to 207*4882a593Smuzhiyun handle submission of dependent operations 208*4882a593Smuzhiyun 209*4882a593Smuzhiyun4.2 "My application needs exclusive control of hardware channels" 210*4882a593Smuzhiyun----------------------------------------------------------------- 211*4882a593Smuzhiyun 212*4882a593SmuzhiyunPrimarily this requirement arises from cases where a DMA engine driver 213*4882a593Smuzhiyunis being used to support device-to-memory operations. A channel that is 214*4882a593Smuzhiyunperforming these operations cannot, for many platform specific reasons, 215*4882a593Smuzhiyunbe shared. For these cases the dma_request_channel() interface is 216*4882a593Smuzhiyunprovided. 217*4882a593Smuzhiyun 218*4882a593SmuzhiyunThe interface is:: 219*4882a593Smuzhiyun 220*4882a593Smuzhiyun struct dma_chan *dma_request_channel(dma_cap_mask_t mask, 221*4882a593Smuzhiyun dma_filter_fn filter_fn, 222*4882a593Smuzhiyun void *filter_param); 223*4882a593Smuzhiyun 224*4882a593SmuzhiyunWhere dma_filter_fn is defined as:: 225*4882a593Smuzhiyun 226*4882a593Smuzhiyun typedef bool (*dma_filter_fn)(struct dma_chan *chan, void *filter_param); 227*4882a593Smuzhiyun 228*4882a593SmuzhiyunWhen the optional 'filter_fn' parameter is set to NULL 229*4882a593Smuzhiyundma_request_channel simply returns the first channel that satisfies the 230*4882a593Smuzhiyuncapability mask. Otherwise, when the mask parameter is insufficient for 231*4882a593Smuzhiyunspecifying the necessary channel, the filter_fn routine can be used to 232*4882a593Smuzhiyundisposition the available channels in the system. The filter_fn routine 233*4882a593Smuzhiyunis called once for each free channel in the system. Upon seeing a 234*4882a593Smuzhiyunsuitable channel filter_fn returns DMA_ACK which flags that channel to 235*4882a593Smuzhiyunbe the return value from dma_request_channel. A channel allocated via 236*4882a593Smuzhiyunthis interface is exclusive to the caller, until dma_release_channel() 237*4882a593Smuzhiyunis called. 238*4882a593Smuzhiyun 239*4882a593SmuzhiyunThe DMA_PRIVATE capability flag is used to tag dma devices that should 240*4882a593Smuzhiyunnot be used by the general-purpose allocator. It can be set at 241*4882a593Smuzhiyuninitialization time if it is known that a channel will always be 242*4882a593Smuzhiyunprivate. Alternatively, it is set when dma_request_channel() finds an 243*4882a593Smuzhiyununused "public" channel. 244*4882a593Smuzhiyun 245*4882a593SmuzhiyunA couple caveats to note when implementing a driver and consumer: 246*4882a593Smuzhiyun 247*4882a593Smuzhiyun1. Once a channel has been privately allocated it will no longer be 248*4882a593Smuzhiyun considered by the general-purpose allocator even after a call to 249*4882a593Smuzhiyun dma_release_channel(). 250*4882a593Smuzhiyun2. Since capabilities are specified at the device level a dma_device 251*4882a593Smuzhiyun with multiple channels will either have all channels public, or all 252*4882a593Smuzhiyun channels private. 253*4882a593Smuzhiyun 254*4882a593Smuzhiyun5. Source 255*4882a593Smuzhiyun--------- 256*4882a593Smuzhiyun 257*4882a593Smuzhiyuninclude/linux/dmaengine.h: 258*4882a593Smuzhiyun core header file for DMA drivers and api users 259*4882a593Smuzhiyundrivers/dma/dmaengine.c: 260*4882a593Smuzhiyun offload engine channel management routines 261*4882a593Smuzhiyundrivers/dma/: 262*4882a593Smuzhiyun location for offload engine drivers 263*4882a593Smuzhiyuninclude/linux/async_tx.h: 264*4882a593Smuzhiyun core header file for the async_tx api 265*4882a593Smuzhiyuncrypto/async_tx/async_tx.c: 266*4882a593Smuzhiyun async_tx interface to dmaengine and common code 267*4882a593Smuzhiyuncrypto/async_tx/async_memcpy.c: 268*4882a593Smuzhiyun copy offload 269*4882a593Smuzhiyuncrypto/async_tx/async_xor.c: 270*4882a593Smuzhiyun xor and xor zero sum offload 271