1*4882a593Smuzhiyun.. SPDX-License-Identifier: GPL-2.0+ 2*4882a593Smuzhiyun 3*4882a593Smuzhiyun====================================================== 4*4882a593SmuzhiyunIBM Virtual Management Channel Kernel Driver (IBMVMC) 5*4882a593Smuzhiyun====================================================== 6*4882a593Smuzhiyun 7*4882a593Smuzhiyun:Authors: 8*4882a593Smuzhiyun Dave Engebretsen <engebret@us.ibm.com>, 9*4882a593Smuzhiyun Adam Reznechek <adreznec@linux.vnet.ibm.com>, 10*4882a593Smuzhiyun Steven Royer <seroyer@linux.vnet.ibm.com>, 11*4882a593Smuzhiyun Bryant G. Ly <bryantly@linux.vnet.ibm.com>, 12*4882a593Smuzhiyun 13*4882a593SmuzhiyunIntroduction 14*4882a593Smuzhiyun============ 15*4882a593Smuzhiyun 16*4882a593SmuzhiyunNote: Knowledge of virtualization technology is required to understand 17*4882a593Smuzhiyunthis document. 18*4882a593Smuzhiyun 19*4882a593SmuzhiyunA good reference document would be: 20*4882a593Smuzhiyun 21*4882a593Smuzhiyunhttps://openpowerfoundation.org/wp-content/uploads/2016/05/LoPAPR_DRAFT_v11_24March2016_cmt1.pdf 22*4882a593Smuzhiyun 23*4882a593SmuzhiyunThe Virtual Management Channel (VMC) is a logical device which provides an 24*4882a593Smuzhiyuninterface between the hypervisor and a management partition. This interface 25*4882a593Smuzhiyunis like a message passing interface. This management partition is intended 26*4882a593Smuzhiyunto provide an alternative to systems that use a Hardware Management 27*4882a593SmuzhiyunConsole (HMC) - based system management. 28*4882a593Smuzhiyun 29*4882a593SmuzhiyunThe primary hardware management solution that is developed by IBM relies 30*4882a593Smuzhiyunon an appliance server named the Hardware Management Console (HMC), 31*4882a593Smuzhiyunpackaged as an external tower or rack-mounted personal computer. In a 32*4882a593SmuzhiyunPower Systems environment, a single HMC can manage multiple POWER 33*4882a593Smuzhiyunprocessor-based systems. 34*4882a593Smuzhiyun 35*4882a593SmuzhiyunManagement Application 36*4882a593Smuzhiyun---------------------- 37*4882a593Smuzhiyun 38*4882a593SmuzhiyunIn the management partition, a management application exists which enables 39*4882a593Smuzhiyuna system administrator to configure the system’s partitioning 40*4882a593Smuzhiyuncharacteristics via a command line interface (CLI) or Representational 41*4882a593SmuzhiyunState Transfer Application (REST API's). 42*4882a593Smuzhiyun 43*4882a593SmuzhiyunThe management application runs on a Linux logical partition on a 44*4882a593SmuzhiyunPOWER8 or newer processor-based server that is virtualized by PowerVM. 45*4882a593SmuzhiyunSystem configuration, maintenance, and control functions which 46*4882a593Smuzhiyuntraditionally require an HMC can be implemented in the management 47*4882a593Smuzhiyunapplication using a combination of HMC to hypervisor interfaces and 48*4882a593Smuzhiyunexisting operating system methods. This tool provides a subset of the 49*4882a593Smuzhiyunfunctions implemented by the HMC and enables basic partition configuration. 50*4882a593SmuzhiyunThe set of HMC to hypervisor messages supported by the management 51*4882a593Smuzhiyunapplication component are passed to the hypervisor over a VMC interface, 52*4882a593Smuzhiyunwhich is defined below. 53*4882a593Smuzhiyun 54*4882a593SmuzhiyunThe VMC enables the management partition to provide basic partitioning 55*4882a593Smuzhiyunfunctions: 56*4882a593Smuzhiyun 57*4882a593Smuzhiyun- Logical Partitioning Configuration 58*4882a593Smuzhiyun- Start, and stop actions for individual partitions 59*4882a593Smuzhiyun- Display of partition status 60*4882a593Smuzhiyun- Management of virtual Ethernet 61*4882a593Smuzhiyun- Management of virtual Storage 62*4882a593Smuzhiyun- Basic system management 63*4882a593Smuzhiyun 64*4882a593SmuzhiyunVirtual Management Channel (VMC) 65*4882a593Smuzhiyun-------------------------------- 66*4882a593Smuzhiyun 67*4882a593SmuzhiyunA logical device, called the Virtual Management Channel (VMC), is defined 68*4882a593Smuzhiyunfor communicating between the management application and the hypervisor. It 69*4882a593Smuzhiyunbasically creates the pipes that enable virtualization management 70*4882a593Smuzhiyunsoftware. This device is presented to a designated management partition as 71*4882a593Smuzhiyuna virtual device. 72*4882a593Smuzhiyun 73*4882a593SmuzhiyunThis communication device uses Command/Response Queue (CRQ) and the 74*4882a593SmuzhiyunRemote Direct Memory Access (RDMA) interfaces. A three-way handshake is 75*4882a593Smuzhiyundefined that must take place to establish that both the hypervisor and 76*4882a593Smuzhiyunmanagement partition sides of the channel are running prior to 77*4882a593Smuzhiyunsending/receiving any of the protocol messages. 78*4882a593Smuzhiyun 79*4882a593SmuzhiyunThis driver also utilizes Transport Event CRQs. CRQ messages are sent 80*4882a593Smuzhiyunwhen the hypervisor detects one of the peer partitions has abnormally 81*4882a593Smuzhiyunterminated, or one side has called H_FREE_CRQ to close their CRQ. 82*4882a593SmuzhiyunTwo new classes of CRQ messages are introduced for the VMC device. VMC 83*4882a593SmuzhiyunAdministrative messages are used for each partition using the VMC to 84*4882a593Smuzhiyuncommunicate capabilities to their partner. HMC Interface messages are used 85*4882a593Smuzhiyunfor the actual flow of HMC messages between the management partition and 86*4882a593Smuzhiyunthe hypervisor. As most HMC messages far exceed the size of a CRQ buffer, 87*4882a593Smuzhiyuna virtual DMA (RMDA) of the HMC message data is done prior to each HMC 88*4882a593SmuzhiyunInterface CRQ message. Only the management partition drives RDMA 89*4882a593Smuzhiyunoperations; hypervisors never directly cause the movement of message data. 90*4882a593Smuzhiyun 91*4882a593Smuzhiyun 92*4882a593SmuzhiyunTerminology 93*4882a593Smuzhiyun----------- 94*4882a593SmuzhiyunRDMA 95*4882a593Smuzhiyun Remote Direct Memory Access is DMA transfer from the server to its 96*4882a593Smuzhiyun client or from the server to its partner partition. DMA refers 97*4882a593Smuzhiyun to both physical I/O to and from memory operations and to memory 98*4882a593Smuzhiyun to memory move operations. 99*4882a593SmuzhiyunCRQ 100*4882a593Smuzhiyun Command/Response Queue a facility which is used to communicate 101*4882a593Smuzhiyun between partner partitions. Transport events which are signaled 102*4882a593Smuzhiyun from the hypervisor to partition are also reported in this queue. 103*4882a593Smuzhiyun 104*4882a593SmuzhiyunExample Management Partition VMC Driver Interface 105*4882a593Smuzhiyun================================================= 106*4882a593Smuzhiyun 107*4882a593SmuzhiyunThis section provides an example for the management application 108*4882a593Smuzhiyunimplementation where a device driver is used to interface to the VMC 109*4882a593Smuzhiyundevice. This driver consists of a new device, for example /dev/ibmvmc, 110*4882a593Smuzhiyunwhich provides interfaces to open, close, read, write, and perform 111*4882a593Smuzhiyunioctl’s against the VMC device. 112*4882a593Smuzhiyun 113*4882a593SmuzhiyunVMC Interface Initialization 114*4882a593Smuzhiyun---------------------------- 115*4882a593Smuzhiyun 116*4882a593SmuzhiyunThe device driver is responsible for initializing the VMC when the driver 117*4882a593Smuzhiyunis loaded. It first creates and initializes the CRQ. Next, an exchange of 118*4882a593SmuzhiyunVMC capabilities is performed to indicate the code version and number of 119*4882a593Smuzhiyunresources available in both the management partition and the hypervisor. 120*4882a593SmuzhiyunFinally, the hypervisor requests that the management partition create an 121*4882a593Smuzhiyuninitial pool of VMC buffers, one buffer for each possible HMC connection, 122*4882a593Smuzhiyunwhich will be used for management application session initialization. 123*4882a593SmuzhiyunPrior to completion of this initialization sequence, the device returns 124*4882a593SmuzhiyunEBUSY to open() calls. EIO is returned for all open() failures. 125*4882a593Smuzhiyun 126*4882a593Smuzhiyun:: 127*4882a593Smuzhiyun 128*4882a593Smuzhiyun Management Partition Hypervisor 129*4882a593Smuzhiyun CRQ INIT 130*4882a593Smuzhiyun ----------------------------------------> 131*4882a593Smuzhiyun CRQ INIT COMPLETE 132*4882a593Smuzhiyun <---------------------------------------- 133*4882a593Smuzhiyun CAPABILITIES 134*4882a593Smuzhiyun ----------------------------------------> 135*4882a593Smuzhiyun CAPABILITIES RESPONSE 136*4882a593Smuzhiyun <---------------------------------------- 137*4882a593Smuzhiyun ADD BUFFER (HMC IDX=0,1,..) _ 138*4882a593Smuzhiyun <---------------------------------------- | 139*4882a593Smuzhiyun ADD BUFFER RESPONSE | - Perform # HMCs Iterations 140*4882a593Smuzhiyun ----------------------------------------> - 141*4882a593Smuzhiyun 142*4882a593SmuzhiyunVMC Interface Open 143*4882a593Smuzhiyun------------------ 144*4882a593Smuzhiyun 145*4882a593SmuzhiyunAfter the basic VMC channel has been initialized, an HMC session level 146*4882a593Smuzhiyunconnection can be established. The application layer performs an open() to 147*4882a593Smuzhiyunthe VMC device and executes an ioctl() against it, indicating the HMC ID 148*4882a593Smuzhiyun(32 bytes of data) for this session. If the VMC device is in an invalid 149*4882a593Smuzhiyunstate, EIO will be returned for the ioctl(). The device driver creates a 150*4882a593Smuzhiyunnew HMC session value (ranging from 1 to 255) and HMC index value (starting 151*4882a593Smuzhiyunat index 0 and ranging to 254) for this HMC ID. The driver then does an 152*4882a593SmuzhiyunRDMA of the HMC ID to the hypervisor, and then sends an Interface Open 153*4882a593Smuzhiyunmessage to the hypervisor to establish the session over the VMC. After the 154*4882a593Smuzhiyunhypervisor receives this information, it sends Add Buffer messages to the 155*4882a593Smuzhiyunmanagement partition to seed an initial pool of buffers for the new HMC 156*4882a593Smuzhiyunconnection. Finally, the hypervisor sends an Interface Open Response 157*4882a593Smuzhiyunmessage, to indicate that it is ready for normal runtime messaging. The 158*4882a593Smuzhiyunfollowing illustrates this VMC flow: 159*4882a593Smuzhiyun 160*4882a593Smuzhiyun:: 161*4882a593Smuzhiyun 162*4882a593Smuzhiyun Management Partition Hypervisor 163*4882a593Smuzhiyun RDMA HMC ID 164*4882a593Smuzhiyun ----------------------------------------> 165*4882a593Smuzhiyun Interface Open 166*4882a593Smuzhiyun ----------------------------------------> 167*4882a593Smuzhiyun Add Buffer _ 168*4882a593Smuzhiyun <---------------------------------------- | 169*4882a593Smuzhiyun Add Buffer Response | - Perform N Iterations 170*4882a593Smuzhiyun ----------------------------------------> - 171*4882a593Smuzhiyun Interface Open Response 172*4882a593Smuzhiyun <---------------------------------------- 173*4882a593Smuzhiyun 174*4882a593SmuzhiyunVMC Interface Runtime 175*4882a593Smuzhiyun--------------------- 176*4882a593Smuzhiyun 177*4882a593SmuzhiyunDuring normal runtime, the management application and the hypervisor 178*4882a593Smuzhiyunexchange HMC messages via the Signal VMC message and RDMA operations. When 179*4882a593Smuzhiyunsending data to the hypervisor, the management application performs a 180*4882a593Smuzhiyunwrite() to the VMC device, and the driver RDMA’s the data to the hypervisor 181*4882a593Smuzhiyunand then sends a Signal Message. If a write() is attempted before VMC 182*4882a593Smuzhiyundevice buffers have been made available by the hypervisor, or no buffers 183*4882a593Smuzhiyunare currently available, EBUSY is returned in response to the write(). A 184*4882a593Smuzhiyunwrite() will return EIO for all other errors, such as an invalid device 185*4882a593Smuzhiyunstate. When the hypervisor sends a message to the management, the data is 186*4882a593Smuzhiyunput into a VMC buffer and an Signal Message is sent to the VMC driver in 187*4882a593Smuzhiyunthe management partition. The driver RDMA’s the buffer into the partition 188*4882a593Smuzhiyunand passes the data up to the appropriate management application via a 189*4882a593Smuzhiyunread() to the VMC device. The read() request blocks if there is no buffer 190*4882a593Smuzhiyunavailable to read. The management application may use select() to wait for 191*4882a593Smuzhiyunthe VMC device to become ready with data to read. 192*4882a593Smuzhiyun 193*4882a593Smuzhiyun:: 194*4882a593Smuzhiyun 195*4882a593Smuzhiyun Management Partition Hypervisor 196*4882a593Smuzhiyun MSG RDMA 197*4882a593Smuzhiyun ----------------------------------------> 198*4882a593Smuzhiyun SIGNAL MSG 199*4882a593Smuzhiyun ----------------------------------------> 200*4882a593Smuzhiyun SIGNAL MSG 201*4882a593Smuzhiyun <---------------------------------------- 202*4882a593Smuzhiyun MSG RDMA 203*4882a593Smuzhiyun <---------------------------------------- 204*4882a593Smuzhiyun 205*4882a593SmuzhiyunVMC Interface Close 206*4882a593Smuzhiyun------------------- 207*4882a593Smuzhiyun 208*4882a593SmuzhiyunHMC session level connections are closed by the management partition when 209*4882a593Smuzhiyunthe application layer performs a close() against the device. This action 210*4882a593Smuzhiyunresults in an Interface Close message flowing to the hypervisor, which 211*4882a593Smuzhiyuncauses the session to be terminated. The device driver must free any 212*4882a593Smuzhiyunstorage allocated for buffers for this HMC connection. 213*4882a593Smuzhiyun 214*4882a593Smuzhiyun:: 215*4882a593Smuzhiyun 216*4882a593Smuzhiyun Management Partition Hypervisor 217*4882a593Smuzhiyun INTERFACE CLOSE 218*4882a593Smuzhiyun ----------------------------------------> 219*4882a593Smuzhiyun INTERFACE CLOSE RESPONSE 220*4882a593Smuzhiyun <---------------------------------------- 221*4882a593Smuzhiyun 222*4882a593SmuzhiyunAdditional Information 223*4882a593Smuzhiyun====================== 224*4882a593Smuzhiyun 225*4882a593SmuzhiyunFor more information on the documentation for CRQ Messages, VMC Messages, 226*4882a593SmuzhiyunHMC interface Buffers, and signal messages please refer to the Linux on 227*4882a593SmuzhiyunPower Architecture Platform Reference. Section F. 228