xref: /OK3568_Linux_fs/kernel/Documentation/admin-guide/device-mapper/switch.rst (revision 4882a59341e53eb6f0b4789bf948001014eff981)
1*4882a593Smuzhiyun=========
2*4882a593Smuzhiyundm-switch
3*4882a593Smuzhiyun=========
4*4882a593Smuzhiyun
5*4882a593SmuzhiyunThe device-mapper switch target creates a device that supports an
6*4882a593Smuzhiyunarbitrary mapping of fixed-size regions of I/O across a fixed set of
7*4882a593Smuzhiyunpaths.  The path used for any specific region can be switched
8*4882a593Smuzhiyundynamically by sending the target a message.
9*4882a593Smuzhiyun
10*4882a593SmuzhiyunIt maps I/O to underlying block devices efficiently when there is a large
11*4882a593Smuzhiyunnumber of fixed-sized address regions but there is no simple pattern
12*4882a593Smuzhiyunthat would allow for a compact representation of the mapping such as
13*4882a593Smuzhiyundm-stripe.
14*4882a593Smuzhiyun
15*4882a593SmuzhiyunBackground
16*4882a593Smuzhiyun----------
17*4882a593Smuzhiyun
18*4882a593SmuzhiyunDell EqualLogic and some other iSCSI storage arrays use a distributed
19*4882a593Smuzhiyunframeless architecture.  In this architecture, the storage group
20*4882a593Smuzhiyunconsists of a number of distinct storage arrays ("members") each having
21*4882a593Smuzhiyunindependent controllers, disk storage and network adapters.  When a LUN
22*4882a593Smuzhiyunis created it is spread across multiple members.  The details of the
23*4882a593Smuzhiyunspreading are hidden from initiators connected to this storage system.
24*4882a593SmuzhiyunThe storage group exposes a single target discovery portal, no matter
25*4882a593Smuzhiyunhow many members are being used.  When iSCSI sessions are created, each
26*4882a593Smuzhiyunsession is connected to an eth port on a single member.  Data to a LUN
27*4882a593Smuzhiyuncan be sent on any iSCSI session, and if the blocks being accessed are
28*4882a593Smuzhiyunstored on another member the I/O will be forwarded as required.  This
29*4882a593Smuzhiyunforwarding is invisible to the initiator.  The storage layout is also
30*4882a593Smuzhiyundynamic, and the blocks stored on disk may be moved from member to
31*4882a593Smuzhiyunmember as needed to balance the load.
32*4882a593Smuzhiyun
33*4882a593SmuzhiyunThis architecture simplifies the management and configuration of both
34*4882a593Smuzhiyunthe storage group and initiators.  In a multipathing configuration, it
35*4882a593Smuzhiyunis possible to set up multiple iSCSI sessions to use multiple network
36*4882a593Smuzhiyuninterfaces on both the host and target to take advantage of the
37*4882a593Smuzhiyunincreased network bandwidth.  An initiator could use a simple round
38*4882a593Smuzhiyunrobin algorithm to send I/O across all paths and let the storage array
39*4882a593Smuzhiyunmembers forward it as necessary, but there is a performance advantage to
40*4882a593Smuzhiyunsending data directly to the correct member.
41*4882a593Smuzhiyun
42*4882a593SmuzhiyunA device-mapper table already lets you map different regions of a
43*4882a593Smuzhiyundevice onto different targets.  However in this architecture the LUN is
44*4882a593Smuzhiyunspread with an address region size on the order of 10s of MBs, which
45*4882a593Smuzhiyunmeans the resulting table could have more than a million entries and
46*4882a593Smuzhiyunconsume far too much memory.
47*4882a593Smuzhiyun
48*4882a593SmuzhiyunUsing this device-mapper switch target we can now build a two-layer
49*4882a593Smuzhiyundevice hierarchy:
50*4882a593Smuzhiyun
51*4882a593Smuzhiyun    Upper Tier - Determine which array member the I/O should be sent to.
52*4882a593Smuzhiyun    Lower Tier - Load balance amongst paths to a particular member.
53*4882a593Smuzhiyun
54*4882a593SmuzhiyunThe lower tier consists of a single dm multipath device for each member.
55*4882a593SmuzhiyunEach of these multipath devices contains the set of paths directly to
56*4882a593Smuzhiyunthe array member in one priority group, and leverages existing path
57*4882a593Smuzhiyunselectors to load balance amongst these paths.  We also build a
58*4882a593Smuzhiyunnon-preferred priority group containing paths to other array members for
59*4882a593Smuzhiyunfailover reasons.
60*4882a593Smuzhiyun
61*4882a593SmuzhiyunThe upper tier consists of a single dm-switch device.  This device uses
62*4882a593Smuzhiyuna bitmap to look up the location of the I/O and choose the appropriate
63*4882a593Smuzhiyunlower tier device to route the I/O.  By using a bitmap we are able to
64*4882a593Smuzhiyunuse 4 bits for each address range in a 16 member group (which is very
65*4882a593Smuzhiyunlarge for us).  This is a much denser representation than the dm table
66*4882a593Smuzhiyunb-tree can achieve.
67*4882a593Smuzhiyun
68*4882a593SmuzhiyunConstruction Parameters
69*4882a593Smuzhiyun=======================
70*4882a593Smuzhiyun
71*4882a593Smuzhiyun    <num_paths> <region_size> <num_optional_args> [<optional_args>...] [<dev_path> <offset>]+
72*4882a593Smuzhiyun	<num_paths>
73*4882a593Smuzhiyun	    The number of paths across which to distribute the I/O.
74*4882a593Smuzhiyun
75*4882a593Smuzhiyun	<region_size>
76*4882a593Smuzhiyun	    The number of 512-byte sectors in a region. Each region can be redirected
77*4882a593Smuzhiyun	    to any of the available paths.
78*4882a593Smuzhiyun
79*4882a593Smuzhiyun	<num_optional_args>
80*4882a593Smuzhiyun	    The number of optional arguments. Currently, no optional arguments
81*4882a593Smuzhiyun	    are supported and so this must be zero.
82*4882a593Smuzhiyun
83*4882a593Smuzhiyun	<dev_path>
84*4882a593Smuzhiyun	    The block device that represents a specific path to the device.
85*4882a593Smuzhiyun
86*4882a593Smuzhiyun	<offset>
87*4882a593Smuzhiyun	    The offset of the start of data on the specific <dev_path> (in units
88*4882a593Smuzhiyun	    of 512-byte sectors). This number is added to the sector number when
89*4882a593Smuzhiyun	    forwarding the request to the specific path. Typically it is zero.
90*4882a593Smuzhiyun
91*4882a593SmuzhiyunMessages
92*4882a593Smuzhiyun========
93*4882a593Smuzhiyun
94*4882a593Smuzhiyunset_region_mappings <index>:<path_nr> [<index>]:<path_nr> [<index>]:<path_nr>...
95*4882a593Smuzhiyun
96*4882a593SmuzhiyunModify the region table by specifying which regions are redirected to
97*4882a593Smuzhiyunwhich paths.
98*4882a593Smuzhiyun
99*4882a593Smuzhiyun<index>
100*4882a593Smuzhiyun    The region number (region size was specified in constructor parameters).
101*4882a593Smuzhiyun    If index is omitted, the next region (previous index + 1) is used.
102*4882a593Smuzhiyun    Expressed in hexadecimal (WITHOUT any prefix like 0x).
103*4882a593Smuzhiyun
104*4882a593Smuzhiyun<path_nr>
105*4882a593Smuzhiyun    The path number in the range 0 ... (<num_paths> - 1).
106*4882a593Smuzhiyun    Expressed in hexadecimal (WITHOUT any prefix like 0x).
107*4882a593Smuzhiyun
108*4882a593SmuzhiyunR<n>,<m>
109*4882a593Smuzhiyun    This parameter allows repetitive patterns to be loaded quickly. <n> and <m>
110*4882a593Smuzhiyun    are hexadecimal numbers. The last <n> mappings are repeated in the next <m>
111*4882a593Smuzhiyun    slots.
112*4882a593Smuzhiyun
113*4882a593SmuzhiyunStatus
114*4882a593Smuzhiyun======
115*4882a593Smuzhiyun
116*4882a593SmuzhiyunNo status line is reported.
117*4882a593Smuzhiyun
118*4882a593SmuzhiyunExample
119*4882a593Smuzhiyun=======
120*4882a593Smuzhiyun
121*4882a593SmuzhiyunAssume that you have volumes vg1/switch0 vg1/switch1 vg1/switch2 with
122*4882a593Smuzhiyunthe same size.
123*4882a593Smuzhiyun
124*4882a593SmuzhiyunCreate a switch device with 64kB region size::
125*4882a593Smuzhiyun
126*4882a593Smuzhiyun    dmsetup create switch --table "0 `blockdev --getsz /dev/vg1/switch0`
127*4882a593Smuzhiyun	switch 3 128 0 /dev/vg1/switch0 0 /dev/vg1/switch1 0 /dev/vg1/switch2 0"
128*4882a593Smuzhiyun
129*4882a593SmuzhiyunSet mappings for the first 7 entries to point to devices switch0, switch1,
130*4882a593Smuzhiyunswitch2, switch0, switch1, switch2, switch1::
131*4882a593Smuzhiyun
132*4882a593Smuzhiyun    dmsetup message switch 0 set_region_mappings 0:0 :1 :2 :0 :1 :2 :1
133*4882a593Smuzhiyun
134*4882a593SmuzhiyunSet repetitive mapping. This command::
135*4882a593Smuzhiyun
136*4882a593Smuzhiyun    dmsetup message switch 0 set_region_mappings 1000:1 :2 R2,10
137*4882a593Smuzhiyun
138*4882a593Smuzhiyunis equivalent to::
139*4882a593Smuzhiyun
140*4882a593Smuzhiyun    dmsetup message switch 0 set_region_mappings 1000:1 :2 :1 :2 :1 :2 :1 :2 \
141*4882a593Smuzhiyun	:1 :2 :1 :2 :1 :2 :1 :2 :1 :2
142