1*4882a593Smuzhiyun========= 2*4882a593Smuzhiyundm-switch 3*4882a593Smuzhiyun========= 4*4882a593Smuzhiyun 5*4882a593SmuzhiyunThe device-mapper switch target creates a device that supports an 6*4882a593Smuzhiyunarbitrary mapping of fixed-size regions of I/O across a fixed set of 7*4882a593Smuzhiyunpaths. The path used for any specific region can be switched 8*4882a593Smuzhiyundynamically by sending the target a message. 9*4882a593Smuzhiyun 10*4882a593SmuzhiyunIt maps I/O to underlying block devices efficiently when there is a large 11*4882a593Smuzhiyunnumber of fixed-sized address regions but there is no simple pattern 12*4882a593Smuzhiyunthat would allow for a compact representation of the mapping such as 13*4882a593Smuzhiyundm-stripe. 14*4882a593Smuzhiyun 15*4882a593SmuzhiyunBackground 16*4882a593Smuzhiyun---------- 17*4882a593Smuzhiyun 18*4882a593SmuzhiyunDell EqualLogic and some other iSCSI storage arrays use a distributed 19*4882a593Smuzhiyunframeless architecture. In this architecture, the storage group 20*4882a593Smuzhiyunconsists of a number of distinct storage arrays ("members") each having 21*4882a593Smuzhiyunindependent controllers, disk storage and network adapters. When a LUN 22*4882a593Smuzhiyunis created it is spread across multiple members. The details of the 23*4882a593Smuzhiyunspreading are hidden from initiators connected to this storage system. 24*4882a593SmuzhiyunThe storage group exposes a single target discovery portal, no matter 25*4882a593Smuzhiyunhow many members are being used. When iSCSI sessions are created, each 26*4882a593Smuzhiyunsession is connected to an eth port on a single member. Data to a LUN 27*4882a593Smuzhiyuncan be sent on any iSCSI session, and if the blocks being accessed are 28*4882a593Smuzhiyunstored on another member the I/O will be forwarded as required. This 29*4882a593Smuzhiyunforwarding is invisible to the initiator. The storage layout is also 30*4882a593Smuzhiyundynamic, and the blocks stored on disk may be moved from member to 31*4882a593Smuzhiyunmember as needed to balance the load. 32*4882a593Smuzhiyun 33*4882a593SmuzhiyunThis architecture simplifies the management and configuration of both 34*4882a593Smuzhiyunthe storage group and initiators. In a multipathing configuration, it 35*4882a593Smuzhiyunis possible to set up multiple iSCSI sessions to use multiple network 36*4882a593Smuzhiyuninterfaces on both the host and target to take advantage of the 37*4882a593Smuzhiyunincreased network bandwidth. An initiator could use a simple round 38*4882a593Smuzhiyunrobin algorithm to send I/O across all paths and let the storage array 39*4882a593Smuzhiyunmembers forward it as necessary, but there is a performance advantage to 40*4882a593Smuzhiyunsending data directly to the correct member. 41*4882a593Smuzhiyun 42*4882a593SmuzhiyunA device-mapper table already lets you map different regions of a 43*4882a593Smuzhiyundevice onto different targets. However in this architecture the LUN is 44*4882a593Smuzhiyunspread with an address region size on the order of 10s of MBs, which 45*4882a593Smuzhiyunmeans the resulting table could have more than a million entries and 46*4882a593Smuzhiyunconsume far too much memory. 47*4882a593Smuzhiyun 48*4882a593SmuzhiyunUsing this device-mapper switch target we can now build a two-layer 49*4882a593Smuzhiyundevice hierarchy: 50*4882a593Smuzhiyun 51*4882a593Smuzhiyun Upper Tier - Determine which array member the I/O should be sent to. 52*4882a593Smuzhiyun Lower Tier - Load balance amongst paths to a particular member. 53*4882a593Smuzhiyun 54*4882a593SmuzhiyunThe lower tier consists of a single dm multipath device for each member. 55*4882a593SmuzhiyunEach of these multipath devices contains the set of paths directly to 56*4882a593Smuzhiyunthe array member in one priority group, and leverages existing path 57*4882a593Smuzhiyunselectors to load balance amongst these paths. We also build a 58*4882a593Smuzhiyunnon-preferred priority group containing paths to other array members for 59*4882a593Smuzhiyunfailover reasons. 60*4882a593Smuzhiyun 61*4882a593SmuzhiyunThe upper tier consists of a single dm-switch device. This device uses 62*4882a593Smuzhiyuna bitmap to look up the location of the I/O and choose the appropriate 63*4882a593Smuzhiyunlower tier device to route the I/O. By using a bitmap we are able to 64*4882a593Smuzhiyunuse 4 bits for each address range in a 16 member group (which is very 65*4882a593Smuzhiyunlarge for us). This is a much denser representation than the dm table 66*4882a593Smuzhiyunb-tree can achieve. 67*4882a593Smuzhiyun 68*4882a593SmuzhiyunConstruction Parameters 69*4882a593Smuzhiyun======================= 70*4882a593Smuzhiyun 71*4882a593Smuzhiyun <num_paths> <region_size> <num_optional_args> [<optional_args>...] [<dev_path> <offset>]+ 72*4882a593Smuzhiyun <num_paths> 73*4882a593Smuzhiyun The number of paths across which to distribute the I/O. 74*4882a593Smuzhiyun 75*4882a593Smuzhiyun <region_size> 76*4882a593Smuzhiyun The number of 512-byte sectors in a region. Each region can be redirected 77*4882a593Smuzhiyun to any of the available paths. 78*4882a593Smuzhiyun 79*4882a593Smuzhiyun <num_optional_args> 80*4882a593Smuzhiyun The number of optional arguments. Currently, no optional arguments 81*4882a593Smuzhiyun are supported and so this must be zero. 82*4882a593Smuzhiyun 83*4882a593Smuzhiyun <dev_path> 84*4882a593Smuzhiyun The block device that represents a specific path to the device. 85*4882a593Smuzhiyun 86*4882a593Smuzhiyun <offset> 87*4882a593Smuzhiyun The offset of the start of data on the specific <dev_path> (in units 88*4882a593Smuzhiyun of 512-byte sectors). This number is added to the sector number when 89*4882a593Smuzhiyun forwarding the request to the specific path. Typically it is zero. 90*4882a593Smuzhiyun 91*4882a593SmuzhiyunMessages 92*4882a593Smuzhiyun======== 93*4882a593Smuzhiyun 94*4882a593Smuzhiyunset_region_mappings <index>:<path_nr> [<index>]:<path_nr> [<index>]:<path_nr>... 95*4882a593Smuzhiyun 96*4882a593SmuzhiyunModify the region table by specifying which regions are redirected to 97*4882a593Smuzhiyunwhich paths. 98*4882a593Smuzhiyun 99*4882a593Smuzhiyun<index> 100*4882a593Smuzhiyun The region number (region size was specified in constructor parameters). 101*4882a593Smuzhiyun If index is omitted, the next region (previous index + 1) is used. 102*4882a593Smuzhiyun Expressed in hexadecimal (WITHOUT any prefix like 0x). 103*4882a593Smuzhiyun 104*4882a593Smuzhiyun<path_nr> 105*4882a593Smuzhiyun The path number in the range 0 ... (<num_paths> - 1). 106*4882a593Smuzhiyun Expressed in hexadecimal (WITHOUT any prefix like 0x). 107*4882a593Smuzhiyun 108*4882a593SmuzhiyunR<n>,<m> 109*4882a593Smuzhiyun This parameter allows repetitive patterns to be loaded quickly. <n> and <m> 110*4882a593Smuzhiyun are hexadecimal numbers. The last <n> mappings are repeated in the next <m> 111*4882a593Smuzhiyun slots. 112*4882a593Smuzhiyun 113*4882a593SmuzhiyunStatus 114*4882a593Smuzhiyun====== 115*4882a593Smuzhiyun 116*4882a593SmuzhiyunNo status line is reported. 117*4882a593Smuzhiyun 118*4882a593SmuzhiyunExample 119*4882a593Smuzhiyun======= 120*4882a593Smuzhiyun 121*4882a593SmuzhiyunAssume that you have volumes vg1/switch0 vg1/switch1 vg1/switch2 with 122*4882a593Smuzhiyunthe same size. 123*4882a593Smuzhiyun 124*4882a593SmuzhiyunCreate a switch device with 64kB region size:: 125*4882a593Smuzhiyun 126*4882a593Smuzhiyun dmsetup create switch --table "0 `blockdev --getsz /dev/vg1/switch0` 127*4882a593Smuzhiyun switch 3 128 0 /dev/vg1/switch0 0 /dev/vg1/switch1 0 /dev/vg1/switch2 0" 128*4882a593Smuzhiyun 129*4882a593SmuzhiyunSet mappings for the first 7 entries to point to devices switch0, switch1, 130*4882a593Smuzhiyunswitch2, switch0, switch1, switch2, switch1:: 131*4882a593Smuzhiyun 132*4882a593Smuzhiyun dmsetup message switch 0 set_region_mappings 0:0 :1 :2 :0 :1 :2 :1 133*4882a593Smuzhiyun 134*4882a593SmuzhiyunSet repetitive mapping. This command:: 135*4882a593Smuzhiyun 136*4882a593Smuzhiyun dmsetup message switch 0 set_region_mappings 1000:1 :2 R2,10 137*4882a593Smuzhiyun 138*4882a593Smuzhiyunis equivalent to:: 139*4882a593Smuzhiyun 140*4882a593Smuzhiyun dmsetup message switch 0 set_region_mappings 1000:1 :2 :1 :2 :1 :2 :1 :2 \ 141*4882a593Smuzhiyun :1 :2 :1 :2 :1 :2 :1 :2 :1 :2 142