1*4882a593Smuzhiyun.. SPDX-License-Identifier: GPL-2.0 2*4882a593Smuzhiyun 3*4882a593Smuzhiyun====================================================== 4*4882a593SmuzhiyunTimekeeping Virtualization for X86-Based Architectures 5*4882a593Smuzhiyun====================================================== 6*4882a593Smuzhiyun 7*4882a593Smuzhiyun:Author: Zachary Amsden <zamsden@redhat.com> 8*4882a593Smuzhiyun:Copyright: (c) 2010, Red Hat. All rights reserved. 9*4882a593Smuzhiyun 10*4882a593Smuzhiyun.. Contents 11*4882a593Smuzhiyun 12*4882a593Smuzhiyun 1) Overview 13*4882a593Smuzhiyun 2) Timing Devices 14*4882a593Smuzhiyun 3) TSC Hardware 15*4882a593Smuzhiyun 4) Virtualization Problems 16*4882a593Smuzhiyun 17*4882a593Smuzhiyun1. Overview 18*4882a593Smuzhiyun=========== 19*4882a593Smuzhiyun 20*4882a593SmuzhiyunOne of the most complicated parts of the X86 platform, and specifically, 21*4882a593Smuzhiyunthe virtualization of this platform is the plethora of timing devices available 22*4882a593Smuzhiyunand the complexity of emulating those devices. In addition, virtualization of 23*4882a593Smuzhiyuntime introduces a new set of challenges because it introduces a multiplexed 24*4882a593Smuzhiyundivision of time beyond the control of the guest CPU. 25*4882a593Smuzhiyun 26*4882a593SmuzhiyunFirst, we will describe the various timekeeping hardware available, then 27*4882a593Smuzhiyunpresent some of the problems which arise and solutions available, giving 28*4882a593Smuzhiyunspecific recommendations for certain classes of KVM guests. 29*4882a593Smuzhiyun 30*4882a593SmuzhiyunThe purpose of this document is to collect data and information relevant to 31*4882a593Smuzhiyuntimekeeping which may be difficult to find elsewhere, specifically, 32*4882a593Smuzhiyuninformation relevant to KVM and hardware-based virtualization. 33*4882a593Smuzhiyun 34*4882a593Smuzhiyun2. Timing Devices 35*4882a593Smuzhiyun================= 36*4882a593Smuzhiyun 37*4882a593SmuzhiyunFirst we discuss the basic hardware devices available. TSC and the related 38*4882a593SmuzhiyunKVM clock are special enough to warrant a full exposition and are described in 39*4882a593Smuzhiyunthe following section. 40*4882a593Smuzhiyun 41*4882a593Smuzhiyun2.1. i8254 - PIT 42*4882a593Smuzhiyun---------------- 43*4882a593Smuzhiyun 44*4882a593SmuzhiyunOne of the first timer devices available is the programmable interrupt timer, 45*4882a593Smuzhiyunor PIT. The PIT has a fixed frequency 1.193182 MHz base clock and three 46*4882a593Smuzhiyunchannels which can be programmed to deliver periodic or one-shot interrupts. 47*4882a593SmuzhiyunThese three channels can be configured in different modes and have individual 48*4882a593Smuzhiyuncounters. Channel 1 and 2 were not available for general use in the original 49*4882a593SmuzhiyunIBM PC, and historically were connected to control RAM refresh and the PC 50*4882a593Smuzhiyunspeaker. Now the PIT is typically integrated as part of an emulated chipset 51*4882a593Smuzhiyunand a separate physical PIT is not used. 52*4882a593Smuzhiyun 53*4882a593SmuzhiyunThe PIT uses I/O ports 0x40 - 0x43. Access to the 16-bit counters is done 54*4882a593Smuzhiyunusing single or multiple byte access to the I/O ports. There are 6 modes 55*4882a593Smuzhiyunavailable, but not all modes are available to all timers, as only timer 2 56*4882a593Smuzhiyunhas a connected gate input, required for modes 1 and 5. The gate line is 57*4882a593Smuzhiyuncontrolled by port 61h, bit 0, as illustrated in the following diagram:: 58*4882a593Smuzhiyun 59*4882a593Smuzhiyun -------------- ---------------- 60*4882a593Smuzhiyun | | | | 61*4882a593Smuzhiyun | 1.1932 MHz|---------->| CLOCK OUT | ---------> IRQ 0 62*4882a593Smuzhiyun | Clock | | | | 63*4882a593Smuzhiyun -------------- | +->| GATE TIMER 0 | 64*4882a593Smuzhiyun | ---------------- 65*4882a593Smuzhiyun | 66*4882a593Smuzhiyun | ---------------- 67*4882a593Smuzhiyun | | | 68*4882a593Smuzhiyun |------>| CLOCK OUT | ---------> 66.3 KHZ DRAM 69*4882a593Smuzhiyun | | | (aka /dev/null) 70*4882a593Smuzhiyun | +->| GATE TIMER 1 | 71*4882a593Smuzhiyun | ---------------- 72*4882a593Smuzhiyun | 73*4882a593Smuzhiyun | ---------------- 74*4882a593Smuzhiyun | | | 75*4882a593Smuzhiyun |------>| CLOCK OUT | ---------> Port 61h, bit 5 76*4882a593Smuzhiyun | | | 77*4882a593Smuzhiyun Port 61h, bit 0 -------->| GATE TIMER 2 | \_.---- ____ 78*4882a593Smuzhiyun ---------------- _| )--|LPF|---Speaker 79*4882a593Smuzhiyun / *---- \___/ 80*4882a593Smuzhiyun Port 61h, bit 1 ---------------------------------/ 81*4882a593Smuzhiyun 82*4882a593SmuzhiyunThe timer modes are now described. 83*4882a593Smuzhiyun 84*4882a593SmuzhiyunMode 0: Single Timeout. 85*4882a593Smuzhiyun This is a one-shot software timeout that counts down 86*4882a593Smuzhiyun when the gate is high (always true for timers 0 and 1). When the count 87*4882a593Smuzhiyun reaches zero, the output goes high. 88*4882a593Smuzhiyun 89*4882a593SmuzhiyunMode 1: Triggered One-shot. 90*4882a593Smuzhiyun The output is initially set high. When the gate 91*4882a593Smuzhiyun line is set high, a countdown is initiated (which does not stop if the gate is 92*4882a593Smuzhiyun lowered), during which the output is set low. When the count reaches zero, 93*4882a593Smuzhiyun the output goes high. 94*4882a593Smuzhiyun 95*4882a593SmuzhiyunMode 2: Rate Generator. 96*4882a593Smuzhiyun The output is initially set high. When the countdown 97*4882a593Smuzhiyun reaches 1, the output goes low for one count and then returns high. The value 98*4882a593Smuzhiyun is reloaded and the countdown automatically resumes. If the gate line goes 99*4882a593Smuzhiyun low, the count is halted. If the output is low when the gate is lowered, the 100*4882a593Smuzhiyun output automatically goes high (this only affects timer 2). 101*4882a593Smuzhiyun 102*4882a593SmuzhiyunMode 3: Square Wave. 103*4882a593Smuzhiyun This generates a high / low square wave. The count 104*4882a593Smuzhiyun determines the length of the pulse, which alternates between high and low 105*4882a593Smuzhiyun when zero is reached. The count only proceeds when gate is high and is 106*4882a593Smuzhiyun automatically reloaded on reaching zero. The count is decremented twice at 107*4882a593Smuzhiyun each clock to generate a full high / low cycle at the full periodic rate. 108*4882a593Smuzhiyun If the count is even, the clock remains high for N/2 counts and low for N/2 109*4882a593Smuzhiyun counts; if the clock is odd, the clock is high for (N+1)/2 counts and low 110*4882a593Smuzhiyun for (N-1)/2 counts. Only even values are latched by the counter, so odd 111*4882a593Smuzhiyun values are not observed when reading. This is the intended mode for timer 2, 112*4882a593Smuzhiyun which generates sine-like tones by low-pass filtering the square wave output. 113*4882a593Smuzhiyun 114*4882a593SmuzhiyunMode 4: Software Strobe. 115*4882a593Smuzhiyun After programming this mode and loading the counter, 116*4882a593Smuzhiyun the output remains high until the counter reaches zero. Then the output 117*4882a593Smuzhiyun goes low for 1 clock cycle and returns high. The counter is not reloaded. 118*4882a593Smuzhiyun Counting only occurs when gate is high. 119*4882a593Smuzhiyun 120*4882a593SmuzhiyunMode 5: Hardware Strobe. 121*4882a593Smuzhiyun After programming and loading the counter, the 122*4882a593Smuzhiyun output remains high. When the gate is raised, a countdown is initiated 123*4882a593Smuzhiyun (which does not stop if the gate is lowered). When the counter reaches zero, 124*4882a593Smuzhiyun the output goes low for 1 clock cycle and then returns high. The counter is 125*4882a593Smuzhiyun not reloaded. 126*4882a593Smuzhiyun 127*4882a593SmuzhiyunIn addition to normal binary counting, the PIT supports BCD counting. The 128*4882a593Smuzhiyuncommand port, 0x43 is used to set the counter and mode for each of the three 129*4882a593Smuzhiyuntimers. 130*4882a593Smuzhiyun 131*4882a593SmuzhiyunPIT commands, issued to port 0x43, using the following bit encoding:: 132*4882a593Smuzhiyun 133*4882a593Smuzhiyun Bit 7-4: Command (See table below) 134*4882a593Smuzhiyun Bit 3-1: Mode (000 = Mode 0, 101 = Mode 5, 11X = undefined) 135*4882a593Smuzhiyun Bit 0 : Binary (0) / BCD (1) 136*4882a593Smuzhiyun 137*4882a593SmuzhiyunCommand table:: 138*4882a593Smuzhiyun 139*4882a593Smuzhiyun 0000 - Latch Timer 0 count for port 0x40 140*4882a593Smuzhiyun sample and hold the count to be read in port 0x40; 141*4882a593Smuzhiyun additional commands ignored until counter is read; 142*4882a593Smuzhiyun mode bits ignored. 143*4882a593Smuzhiyun 144*4882a593Smuzhiyun 0001 - Set Timer 0 LSB mode for port 0x40 145*4882a593Smuzhiyun set timer to read LSB only and force MSB to zero; 146*4882a593Smuzhiyun mode bits set timer mode 147*4882a593Smuzhiyun 148*4882a593Smuzhiyun 0010 - Set Timer 0 MSB mode for port 0x40 149*4882a593Smuzhiyun set timer to read MSB only and force LSB to zero; 150*4882a593Smuzhiyun mode bits set timer mode 151*4882a593Smuzhiyun 152*4882a593Smuzhiyun 0011 - Set Timer 0 16-bit mode for port 0x40 153*4882a593Smuzhiyun set timer to read / write LSB first, then MSB; 154*4882a593Smuzhiyun mode bits set timer mode 155*4882a593Smuzhiyun 156*4882a593Smuzhiyun 0100 - Latch Timer 1 count for port 0x41 - as described above 157*4882a593Smuzhiyun 0101 - Set Timer 1 LSB mode for port 0x41 - as described above 158*4882a593Smuzhiyun 0110 - Set Timer 1 MSB mode for port 0x41 - as described above 159*4882a593Smuzhiyun 0111 - Set Timer 1 16-bit mode for port 0x41 - as described above 160*4882a593Smuzhiyun 161*4882a593Smuzhiyun 1000 - Latch Timer 2 count for port 0x42 - as described above 162*4882a593Smuzhiyun 1001 - Set Timer 2 LSB mode for port 0x42 - as described above 163*4882a593Smuzhiyun 1010 - Set Timer 2 MSB mode for port 0x42 - as described above 164*4882a593Smuzhiyun 1011 - Set Timer 2 16-bit mode for port 0x42 as described above 165*4882a593Smuzhiyun 166*4882a593Smuzhiyun 1101 - General counter latch 167*4882a593Smuzhiyun Latch combination of counters into corresponding ports 168*4882a593Smuzhiyun Bit 3 = Counter 2 169*4882a593Smuzhiyun Bit 2 = Counter 1 170*4882a593Smuzhiyun Bit 1 = Counter 0 171*4882a593Smuzhiyun Bit 0 = Unused 172*4882a593Smuzhiyun 173*4882a593Smuzhiyun 1110 - Latch timer status 174*4882a593Smuzhiyun Latch combination of counter mode into corresponding ports 175*4882a593Smuzhiyun Bit 3 = Counter 2 176*4882a593Smuzhiyun Bit 2 = Counter 1 177*4882a593Smuzhiyun Bit 1 = Counter 0 178*4882a593Smuzhiyun 179*4882a593Smuzhiyun The output of ports 0x40-0x42 following this command will be: 180*4882a593Smuzhiyun 181*4882a593Smuzhiyun Bit 7 = Output pin 182*4882a593Smuzhiyun Bit 6 = Count loaded (0 if timer has expired) 183*4882a593Smuzhiyun Bit 5-4 = Read / Write mode 184*4882a593Smuzhiyun 01 = MSB only 185*4882a593Smuzhiyun 10 = LSB only 186*4882a593Smuzhiyun 11 = LSB / MSB (16-bit) 187*4882a593Smuzhiyun Bit 3-1 = Mode 188*4882a593Smuzhiyun Bit 0 = Binary (0) / BCD mode (1) 189*4882a593Smuzhiyun 190*4882a593Smuzhiyun2.2. RTC 191*4882a593Smuzhiyun-------- 192*4882a593Smuzhiyun 193*4882a593SmuzhiyunThe second device which was available in the original PC was the MC146818 real 194*4882a593Smuzhiyuntime clock. The original device is now obsolete, and usually emulated by the 195*4882a593Smuzhiyunsystem chipset, sometimes by an HPET and some frankenstein IRQ routing. 196*4882a593Smuzhiyun 197*4882a593SmuzhiyunThe RTC is accessed through CMOS variables, which uses an index register to 198*4882a593Smuzhiyuncontrol which bytes are read. Since there is only one index register, read 199*4882a593Smuzhiyunof the CMOS and read of the RTC require lock protection (in addition, it is 200*4882a593Smuzhiyundangerous to allow userspace utilities such as hwclock to have direct RTC 201*4882a593Smuzhiyunaccess, as they could corrupt kernel reads and writes of CMOS memory). 202*4882a593Smuzhiyun 203*4882a593SmuzhiyunThe RTC generates an interrupt which is usually routed to IRQ 8. The interrupt 204*4882a593Smuzhiyuncan function as a periodic timer, an additional once a day alarm, and can issue 205*4882a593Smuzhiyuninterrupts after an update of the CMOS registers by the MC146818 is complete. 206*4882a593SmuzhiyunThe type of interrupt is signalled in the RTC status registers. 207*4882a593Smuzhiyun 208*4882a593SmuzhiyunThe RTC will update the current time fields by battery power even while the 209*4882a593Smuzhiyunsystem is off. The current time fields should not be read while an update is 210*4882a593Smuzhiyunin progress, as indicated in the status register. 211*4882a593Smuzhiyun 212*4882a593SmuzhiyunThe clock uses a 32.768kHz crystal, so bits 6-4 of register A should be 213*4882a593Smuzhiyunprogrammed to a 32kHz divider if the RTC is to count seconds. 214*4882a593Smuzhiyun 215*4882a593SmuzhiyunThis is the RAM map originally used for the RTC/CMOS:: 216*4882a593Smuzhiyun 217*4882a593Smuzhiyun Location Size Description 218*4882a593Smuzhiyun ------------------------------------------ 219*4882a593Smuzhiyun 00h byte Current second (BCD) 220*4882a593Smuzhiyun 01h byte Seconds alarm (BCD) 221*4882a593Smuzhiyun 02h byte Current minute (BCD) 222*4882a593Smuzhiyun 03h byte Minutes alarm (BCD) 223*4882a593Smuzhiyun 04h byte Current hour (BCD) 224*4882a593Smuzhiyun 05h byte Hours alarm (BCD) 225*4882a593Smuzhiyun 06h byte Current day of week (BCD) 226*4882a593Smuzhiyun 07h byte Current day of month (BCD) 227*4882a593Smuzhiyun 08h byte Current month (BCD) 228*4882a593Smuzhiyun 09h byte Current year (BCD) 229*4882a593Smuzhiyun 0Ah byte Register A 230*4882a593Smuzhiyun bit 7 = Update in progress 231*4882a593Smuzhiyun bit 6-4 = Divider for clock 232*4882a593Smuzhiyun 000 = 4.194 MHz 233*4882a593Smuzhiyun 001 = 1.049 MHz 234*4882a593Smuzhiyun 010 = 32 kHz 235*4882a593Smuzhiyun 10X = test modes 236*4882a593Smuzhiyun 110 = reset / disable 237*4882a593Smuzhiyun 111 = reset / disable 238*4882a593Smuzhiyun bit 3-0 = Rate selection for periodic interrupt 239*4882a593Smuzhiyun 000 = periodic timer disabled 240*4882a593Smuzhiyun 001 = 3.90625 uS 241*4882a593Smuzhiyun 010 = 7.8125 uS 242*4882a593Smuzhiyun 011 = .122070 mS 243*4882a593Smuzhiyun 100 = .244141 mS 244*4882a593Smuzhiyun ... 245*4882a593Smuzhiyun 1101 = 125 mS 246*4882a593Smuzhiyun 1110 = 250 mS 247*4882a593Smuzhiyun 1111 = 500 mS 248*4882a593Smuzhiyun 0Bh byte Register B 249*4882a593Smuzhiyun bit 7 = Run (0) / Halt (1) 250*4882a593Smuzhiyun bit 6 = Periodic interrupt enable 251*4882a593Smuzhiyun bit 5 = Alarm interrupt enable 252*4882a593Smuzhiyun bit 4 = Update-ended interrupt enable 253*4882a593Smuzhiyun bit 3 = Square wave interrupt enable 254*4882a593Smuzhiyun bit 2 = BCD calendar (0) / Binary (1) 255*4882a593Smuzhiyun bit 1 = 12-hour mode (0) / 24-hour mode (1) 256*4882a593Smuzhiyun bit 0 = 0 (DST off) / 1 (DST enabled) 257*4882a593Smuzhiyun OCh byte Register C (read only) 258*4882a593Smuzhiyun bit 7 = interrupt request flag (IRQF) 259*4882a593Smuzhiyun bit 6 = periodic interrupt flag (PF) 260*4882a593Smuzhiyun bit 5 = alarm interrupt flag (AF) 261*4882a593Smuzhiyun bit 4 = update interrupt flag (UF) 262*4882a593Smuzhiyun bit 3-0 = reserved 263*4882a593Smuzhiyun ODh byte Register D (read only) 264*4882a593Smuzhiyun bit 7 = RTC has power 265*4882a593Smuzhiyun bit 6-0 = reserved 266*4882a593Smuzhiyun 32h byte Current century BCD (*) 267*4882a593Smuzhiyun (*) location vendor specific and now determined from ACPI global tables 268*4882a593Smuzhiyun 269*4882a593Smuzhiyun2.3. APIC 270*4882a593Smuzhiyun--------- 271*4882a593Smuzhiyun 272*4882a593SmuzhiyunOn Pentium and later processors, an on-board timer is available to each CPU 273*4882a593Smuzhiyunas part of the Advanced Programmable Interrupt Controller. The APIC is 274*4882a593Smuzhiyunaccessed through memory-mapped registers and provides interrupt service to each 275*4882a593SmuzhiyunCPU, used for IPIs and local timer interrupts. 276*4882a593Smuzhiyun 277*4882a593SmuzhiyunAlthough in theory the APIC is a safe and stable source for local interrupts, 278*4882a593Smuzhiyunin practice, many bugs and glitches have occurred due to the special nature of 279*4882a593Smuzhiyunthe APIC CPU-local memory-mapped hardware. Beware that CPU errata may affect 280*4882a593Smuzhiyunthe use of the APIC and that workarounds may be required. In addition, some of 281*4882a593Smuzhiyunthese workarounds pose unique constraints for virtualization - requiring either 282*4882a593Smuzhiyunextra overhead incurred from extra reads of memory-mapped I/O or additional 283*4882a593Smuzhiyunfunctionality that may be more computationally expensive to implement. 284*4882a593Smuzhiyun 285*4882a593SmuzhiyunSince the APIC is documented quite well in the Intel and AMD manuals, we will 286*4882a593Smuzhiyunavoid repetition of the detail here. It should be pointed out that the APIC 287*4882a593Smuzhiyuntimer is programmed through the LVT (local vector timer) register, is capable 288*4882a593Smuzhiyunof one-shot or periodic operation, and is based on the bus clock divided down 289*4882a593Smuzhiyunby the programmable divider register. 290*4882a593Smuzhiyun 291*4882a593Smuzhiyun2.4. HPET 292*4882a593Smuzhiyun--------- 293*4882a593Smuzhiyun 294*4882a593SmuzhiyunHPET is quite complex, and was originally intended to replace the PIT / RTC 295*4882a593Smuzhiyunsupport of the X86 PC. It remains to be seen whether that will be the case, as 296*4882a593Smuzhiyunthe de facto standard of PC hardware is to emulate these older devices. Some 297*4882a593Smuzhiyunsystems designated as legacy free may support only the HPET as a hardware timer 298*4882a593Smuzhiyundevice. 299*4882a593Smuzhiyun 300*4882a593SmuzhiyunThe HPET spec is rather loose and vague, requiring at least 3 hardware timers, 301*4882a593Smuzhiyunbut allowing implementation freedom to support many more. It also imposes no 302*4882a593Smuzhiyunfixed rate on the timer frequency, but does impose some extremal values on 303*4882a593Smuzhiyunfrequency, error and slew. 304*4882a593Smuzhiyun 305*4882a593SmuzhiyunIn general, the HPET is recommended as a high precision (compared to PIT /RTC) 306*4882a593Smuzhiyuntime source which is independent of local variation (as there is only one HPET 307*4882a593Smuzhiyunin any given system). The HPET is also memory-mapped, and its presence is 308*4882a593Smuzhiyunindicated through ACPI tables by the BIOS. 309*4882a593Smuzhiyun 310*4882a593SmuzhiyunDetailed specification of the HPET is beyond the current scope of this 311*4882a593Smuzhiyundocument, as it is also very well documented elsewhere. 312*4882a593Smuzhiyun 313*4882a593Smuzhiyun2.5. Offboard Timers 314*4882a593Smuzhiyun-------------------- 315*4882a593Smuzhiyun 316*4882a593SmuzhiyunSeveral cards, both proprietary (watchdog boards) and commonplace (e1000) have 317*4882a593Smuzhiyuntiming chips built into the cards which may have registers which are accessible 318*4882a593Smuzhiyunto kernel or user drivers. To the author's knowledge, using these to generate 319*4882a593Smuzhiyuna clocksource for a Linux or other kernel has not yet been attempted and is in 320*4882a593Smuzhiyungeneral frowned upon as not playing by the agreed rules of the game. Such a 321*4882a593Smuzhiyuntimer device would require additional support to be virtualized properly and is 322*4882a593Smuzhiyunnot considered important at this time as no known operating system does this. 323*4882a593Smuzhiyun 324*4882a593Smuzhiyun3. TSC Hardware 325*4882a593Smuzhiyun=============== 326*4882a593Smuzhiyun 327*4882a593SmuzhiyunThe TSC or time stamp counter is relatively simple in theory; it counts 328*4882a593Smuzhiyuninstruction cycles issued by the processor, which can be used as a measure of 329*4882a593Smuzhiyuntime. In practice, due to a number of problems, it is the most complicated 330*4882a593Smuzhiyuntimekeeping device to use. 331*4882a593Smuzhiyun 332*4882a593SmuzhiyunThe TSC is represented internally as a 64-bit MSR which can be read with the 333*4882a593SmuzhiyunRDMSR, RDTSC, or RDTSCP (when available) instructions. In the past, hardware 334*4882a593Smuzhiyunlimitations made it possible to write the TSC, but generally on old hardware it 335*4882a593Smuzhiyunwas only possible to write the low 32-bits of the 64-bit counter, and the upper 336*4882a593Smuzhiyun32-bits of the counter were cleared. Now, however, on Intel processors family 337*4882a593Smuzhiyun0Fh, for models 3, 4 and 6, and family 06h, models e and f, this restriction 338*4882a593Smuzhiyunhas been lifted and all 64-bits are writable. On AMD systems, the ability to 339*4882a593Smuzhiyunwrite the TSC MSR is not an architectural guarantee. 340*4882a593Smuzhiyun 341*4882a593SmuzhiyunThe TSC is accessible from CPL-0 and conditionally, for CPL > 0 software by 342*4882a593Smuzhiyunmeans of the CR4.TSD bit, which when enabled, disables CPL > 0 TSC access. 343*4882a593Smuzhiyun 344*4882a593SmuzhiyunSome vendors have implemented an additional instruction, RDTSCP, which returns 345*4882a593Smuzhiyunatomically not just the TSC, but an indicator which corresponds to the 346*4882a593Smuzhiyunprocessor number. This can be used to index into an array of TSC variables to 347*4882a593Smuzhiyundetermine offset information in SMP systems where TSCs are not synchronized. 348*4882a593SmuzhiyunThe presence of this instruction must be determined by consulting CPUID feature 349*4882a593Smuzhiyunbits. 350*4882a593Smuzhiyun 351*4882a593SmuzhiyunBoth VMX and SVM provide extension fields in the virtualization hardware which 352*4882a593Smuzhiyunallows the guest visible TSC to be offset by a constant. Newer implementations 353*4882a593Smuzhiyunpromise to allow the TSC to additionally be scaled, but this hardware is not 354*4882a593Smuzhiyunyet widely available. 355*4882a593Smuzhiyun 356*4882a593Smuzhiyun3.1. TSC synchronization 357*4882a593Smuzhiyun------------------------ 358*4882a593Smuzhiyun 359*4882a593SmuzhiyunThe TSC is a CPU-local clock in most implementations. This means, on SMP 360*4882a593Smuzhiyunplatforms, the TSCs of different CPUs may start at different times depending 361*4882a593Smuzhiyunon when the CPUs are powered on. Generally, CPUs on the same die will share 362*4882a593Smuzhiyunthe same clock, however, this is not always the case. 363*4882a593Smuzhiyun 364*4882a593SmuzhiyunThe BIOS may attempt to resynchronize the TSCs during the poweron process and 365*4882a593Smuzhiyunthe operating system or other system software may attempt to do this as well. 366*4882a593SmuzhiyunSeveral hardware limitations make the problem worse - if it is not possible to 367*4882a593Smuzhiyunwrite the full 64-bits of the TSC, it may be impossible to match the TSC in 368*4882a593Smuzhiyunnewly arriving CPUs to that of the rest of the system, resulting in 369*4882a593Smuzhiyununsynchronized TSCs. This may be done by BIOS or system software, but in 370*4882a593Smuzhiyunpractice, getting a perfectly synchronized TSC will not be possible unless all 371*4882a593Smuzhiyunvalues are read from the same clock, which generally only is possible on single 372*4882a593Smuzhiyunsocket systems or those with special hardware support. 373*4882a593Smuzhiyun 374*4882a593Smuzhiyun3.2. TSC and CPU hotplug 375*4882a593Smuzhiyun------------------------ 376*4882a593Smuzhiyun 377*4882a593SmuzhiyunAs touched on already, CPUs which arrive later than the boot time of the system 378*4882a593Smuzhiyunmay not have a TSC value that is synchronized with the rest of the system. 379*4882a593SmuzhiyunEither system software, BIOS, or SMM code may actually try to establish the TSC 380*4882a593Smuzhiyunto a value matching the rest of the system, but a perfect match is usually not 381*4882a593Smuzhiyuna guarantee. This can have the effect of bringing a system from a state where 382*4882a593SmuzhiyunTSC is synchronized back to a state where TSC synchronization flaws, however 383*4882a593Smuzhiyunsmall, may be exposed to the OS and any virtualization environment. 384*4882a593Smuzhiyun 385*4882a593Smuzhiyun3.3. TSC and multi-socket / NUMA 386*4882a593Smuzhiyun-------------------------------- 387*4882a593Smuzhiyun 388*4882a593SmuzhiyunMulti-socket systems, especially large multi-socket systems are likely to have 389*4882a593Smuzhiyunindividual clocksources rather than a single, universally distributed clock. 390*4882a593SmuzhiyunSince these clocks are driven by different crystals, they will not have 391*4882a593Smuzhiyunperfectly matched frequency, and temperature and electrical variations will 392*4882a593Smuzhiyuncause the CPU clocks, and thus the TSCs to drift over time. Depending on the 393*4882a593Smuzhiyunexact clock and bus design, the drift may or may not be fixed in absolute 394*4882a593Smuzhiyunerror, and may accumulate over time. 395*4882a593Smuzhiyun 396*4882a593SmuzhiyunIn addition, very large systems may deliberately slew the clocks of individual 397*4882a593Smuzhiyuncores. This technique, known as spread-spectrum clocking, reduces EMI at the 398*4882a593Smuzhiyunclock frequency and harmonics of it, which may be required to pass FCC 399*4882a593Smuzhiyunstandards for telecommunications and computer equipment. 400*4882a593Smuzhiyun 401*4882a593SmuzhiyunIt is recommended not to trust the TSCs to remain synchronized on NUMA or 402*4882a593Smuzhiyunmultiple socket systems for these reasons. 403*4882a593Smuzhiyun 404*4882a593Smuzhiyun3.4. TSC and C-states 405*4882a593Smuzhiyun--------------------- 406*4882a593Smuzhiyun 407*4882a593SmuzhiyunC-states, or idling states of the processor, especially C1E and deeper sleep 408*4882a593Smuzhiyunstates may be problematic for TSC as well. The TSC may stop advancing in such 409*4882a593Smuzhiyuna state, resulting in a TSC which is behind that of other CPUs when execution 410*4882a593Smuzhiyunis resumed. Such CPUs must be detected and flagged by the operating system 411*4882a593Smuzhiyunbased on CPU and chipset identifications. 412*4882a593Smuzhiyun 413*4882a593SmuzhiyunThe TSC in such a case may be corrected by catching it up to a known external 414*4882a593Smuzhiyunclocksource. 415*4882a593Smuzhiyun 416*4882a593Smuzhiyun3.5. TSC frequency change / P-states 417*4882a593Smuzhiyun------------------------------------ 418*4882a593Smuzhiyun 419*4882a593SmuzhiyunTo make things slightly more interesting, some CPUs may change frequency. They 420*4882a593Smuzhiyunmay or may not run the TSC at the same rate, and because the frequency change 421*4882a593Smuzhiyunmay be staggered or slewed, at some points in time, the TSC rate may not be 422*4882a593Smuzhiyunknown other than falling within a range of values. In this case, the TSC will 423*4882a593Smuzhiyunnot be a stable time source, and must be calibrated against a known, stable, 424*4882a593Smuzhiyunexternal clock to be a usable source of time. 425*4882a593Smuzhiyun 426*4882a593SmuzhiyunWhether the TSC runs at a constant rate or scales with the P-state is model 427*4882a593Smuzhiyundependent and must be determined by inspecting CPUID, chipset or vendor 428*4882a593Smuzhiyunspecific MSR fields. 429*4882a593Smuzhiyun 430*4882a593SmuzhiyunIn addition, some vendors have known bugs where the P-state is actually 431*4882a593Smuzhiyuncompensated for properly during normal operation, but when the processor is 432*4882a593Smuzhiyuninactive, the P-state may be raised temporarily to service cache misses from 433*4882a593Smuzhiyunother processors. In such cases, the TSC on halted CPUs could advance faster 434*4882a593Smuzhiyunthan that of non-halted processors. AMD Turion processors are known to have 435*4882a593Smuzhiyunthis problem. 436*4882a593Smuzhiyun 437*4882a593Smuzhiyun3.6. TSC and STPCLK / T-states 438*4882a593Smuzhiyun------------------------------ 439*4882a593Smuzhiyun 440*4882a593SmuzhiyunExternal signals given to the processor may also have the effect of stopping 441*4882a593Smuzhiyunthe TSC. This is typically done for thermal emergency power control to prevent 442*4882a593Smuzhiyunan overheating condition, and typically, there is no way to detect that this 443*4882a593Smuzhiyuncondition has happened. 444*4882a593Smuzhiyun 445*4882a593Smuzhiyun3.7. TSC virtualization - VMX 446*4882a593Smuzhiyun----------------------------- 447*4882a593Smuzhiyun 448*4882a593SmuzhiyunVMX provides conditional trapping of RDTSC, RDMSR, WRMSR and RDTSCP 449*4882a593Smuzhiyuninstructions, which is enough for full virtualization of TSC in any manner. In 450*4882a593Smuzhiyunaddition, VMX allows passing through the host TSC plus an additional TSC_OFFSET 451*4882a593Smuzhiyunfield specified in the VMCS. Special instructions must be used to read and 452*4882a593Smuzhiyunwrite the VMCS field. 453*4882a593Smuzhiyun 454*4882a593Smuzhiyun3.8. TSC virtualization - SVM 455*4882a593Smuzhiyun----------------------------- 456*4882a593Smuzhiyun 457*4882a593SmuzhiyunSVM provides conditional trapping of RDTSC, RDMSR, WRMSR and RDTSCP 458*4882a593Smuzhiyuninstructions, which is enough for full virtualization of TSC in any manner. In 459*4882a593Smuzhiyunaddition, SVM allows passing through the host TSC plus an additional offset 460*4882a593Smuzhiyunfield specified in the SVM control block. 461*4882a593Smuzhiyun 462*4882a593Smuzhiyun3.9. TSC feature bits in Linux 463*4882a593Smuzhiyun------------------------------ 464*4882a593Smuzhiyun 465*4882a593SmuzhiyunIn summary, there is no way to guarantee the TSC remains in perfect 466*4882a593Smuzhiyunsynchronization unless it is explicitly guaranteed by the architecture. Even 467*4882a593Smuzhiyunif so, the TSCs in multi-sockets or NUMA systems may still run independently 468*4882a593Smuzhiyundespite being locally consistent. 469*4882a593Smuzhiyun 470*4882a593SmuzhiyunThe following feature bits are used by Linux to signal various TSC attributes, 471*4882a593Smuzhiyunbut they can only be taken to be meaningful for UP or single node systems. 472*4882a593Smuzhiyun 473*4882a593Smuzhiyun========================= ======================================= 474*4882a593SmuzhiyunX86_FEATURE_TSC The TSC is available in hardware 475*4882a593SmuzhiyunX86_FEATURE_RDTSCP The RDTSCP instruction is available 476*4882a593SmuzhiyunX86_FEATURE_CONSTANT_TSC The TSC rate is unchanged with P-states 477*4882a593SmuzhiyunX86_FEATURE_NONSTOP_TSC The TSC does not stop in C-states 478*4882a593SmuzhiyunX86_FEATURE_TSC_RELIABLE TSC sync checks are skipped (VMware) 479*4882a593Smuzhiyun========================= ======================================= 480*4882a593Smuzhiyun 481*4882a593Smuzhiyun4. Virtualization Problems 482*4882a593Smuzhiyun========================== 483*4882a593Smuzhiyun 484*4882a593SmuzhiyunTimekeeping is especially problematic for virtualization because a number of 485*4882a593Smuzhiyunchallenges arise. The most obvious problem is that time is now shared between 486*4882a593Smuzhiyunthe host and, potentially, a number of virtual machines. Thus the virtual 487*4882a593Smuzhiyunoperating system does not run with 100% usage of the CPU, despite the fact that 488*4882a593Smuzhiyunit may very well make that assumption. It may expect it to remain true to very 489*4882a593Smuzhiyunexacting bounds when interrupt sources are disabled, but in reality only its 490*4882a593Smuzhiyunvirtual interrupt sources are disabled, and the machine may still be preempted 491*4882a593Smuzhiyunat any time. This causes problems as the passage of real time, the injection 492*4882a593Smuzhiyunof machine interrupts and the associated clock sources are no longer completely 493*4882a593Smuzhiyunsynchronized with real time. 494*4882a593Smuzhiyun 495*4882a593SmuzhiyunThis same problem can occur on native hardware to a degree, as SMM mode may 496*4882a593Smuzhiyunsteal cycles from the naturally on X86 systems when SMM mode is used by the 497*4882a593SmuzhiyunBIOS, but not in such an extreme fashion. However, the fact that SMM mode may 498*4882a593Smuzhiyuncause similar problems to virtualization makes it a good justification for 499*4882a593Smuzhiyunsolving many of these problems on bare metal. 500*4882a593Smuzhiyun 501*4882a593Smuzhiyun4.1. Interrupt clocking 502*4882a593Smuzhiyun----------------------- 503*4882a593Smuzhiyun 504*4882a593SmuzhiyunOne of the most immediate problems that occurs with legacy operating systems 505*4882a593Smuzhiyunis that the system timekeeping routines are often designed to keep track of 506*4882a593Smuzhiyuntime by counting periodic interrupts. These interrupts may come from the PIT 507*4882a593Smuzhiyunor the RTC, but the problem is the same: the host virtualization engine may not 508*4882a593Smuzhiyunbe able to deliver the proper number of interrupts per second, and so guest 509*4882a593Smuzhiyuntime may fall behind. This is especially problematic if a high interrupt rate 510*4882a593Smuzhiyunis selected, such as 1000 HZ, which is unfortunately the default for many Linux 511*4882a593Smuzhiyunguests. 512*4882a593Smuzhiyun 513*4882a593SmuzhiyunThere are three approaches to solving this problem; first, it may be possible 514*4882a593Smuzhiyunto simply ignore it. Guests which have a separate time source for tracking 515*4882a593Smuzhiyun'wall clock' or 'real time' may not need any adjustment of their interrupts to 516*4882a593Smuzhiyunmaintain proper time. If this is not sufficient, it may be necessary to inject 517*4882a593Smuzhiyunadditional interrupts into the guest in order to increase the effective 518*4882a593Smuzhiyuninterrupt rate. This approach leads to complications in extreme conditions, 519*4882a593Smuzhiyunwhere host load or guest lag is too much to compensate for, and thus another 520*4882a593Smuzhiyunsolution to the problem has risen: the guest may need to become aware of lost 521*4882a593Smuzhiyunticks and compensate for them internally. Although promising in theory, the 522*4882a593Smuzhiyunimplementation of this policy in Linux has been extremely error prone, and a 523*4882a593Smuzhiyunnumber of buggy variants of lost tick compensation are distributed across 524*4882a593Smuzhiyuncommonly used Linux systems. 525*4882a593Smuzhiyun 526*4882a593SmuzhiyunWindows uses periodic RTC clocking as a means of keeping time internally, and 527*4882a593Smuzhiyunthus requires interrupt slewing to keep proper time. It does use a low enough 528*4882a593Smuzhiyunrate (ed: is it 18.2 Hz?) however that it has not yet been a problem in 529*4882a593Smuzhiyunpractice. 530*4882a593Smuzhiyun 531*4882a593Smuzhiyun4.2. TSC sampling and serialization 532*4882a593Smuzhiyun----------------------------------- 533*4882a593Smuzhiyun 534*4882a593SmuzhiyunAs the highest precision time source available, the cycle counter of the CPU 535*4882a593Smuzhiyunhas aroused much interest from developers. As explained above, this timer has 536*4882a593Smuzhiyunmany problems unique to its nature as a local, potentially unstable and 537*4882a593Smuzhiyunpotentially unsynchronized source. One issue which is not unique to the TSC, 538*4882a593Smuzhiyunbut is highlighted because of its very precise nature is sampling delay. By 539*4882a593Smuzhiyundefinition, the counter, once read is already old. However, it is also 540*4882a593Smuzhiyunpossible for the counter to be read ahead of the actual use of the result. 541*4882a593SmuzhiyunThis is a consequence of the superscalar execution of the instruction stream, 542*4882a593Smuzhiyunwhich may execute instructions out of order. Such execution is called 543*4882a593Smuzhiyunnon-serialized. Forcing serialized execution is necessary for precise 544*4882a593Smuzhiyunmeasurement with the TSC, and requires a serializing instruction, such as CPUID 545*4882a593Smuzhiyunor an MSR read. 546*4882a593Smuzhiyun 547*4882a593SmuzhiyunSince CPUID may actually be virtualized by a trap and emulate mechanism, this 548*4882a593Smuzhiyunserialization can pose a performance issue for hardware virtualization. An 549*4882a593Smuzhiyunaccurate time stamp counter reading may therefore not always be available, and 550*4882a593Smuzhiyunit may be necessary for an implementation to guard against "backwards" reads of 551*4882a593Smuzhiyunthe TSC as seen from other CPUs, even in an otherwise perfectly synchronized 552*4882a593Smuzhiyunsystem. 553*4882a593Smuzhiyun 554*4882a593Smuzhiyun4.3. Timespec aliasing 555*4882a593Smuzhiyun---------------------- 556*4882a593Smuzhiyun 557*4882a593SmuzhiyunAdditionally, this lack of serialization from the TSC poses another challenge 558*4882a593Smuzhiyunwhen using results of the TSC when measured against another time source. As 559*4882a593Smuzhiyunthe TSC is much higher precision, many possible values of the TSC may be read 560*4882a593Smuzhiyunwhile another clock is still expressing the same value. 561*4882a593Smuzhiyun 562*4882a593SmuzhiyunThat is, you may read (T,T+10) while external clock C maintains the same value. 563*4882a593SmuzhiyunDue to non-serialized reads, you may actually end up with a range which 564*4882a593Smuzhiyunfluctuates - from (T-1.. T+10). Thus, any time calculated from a TSC, but 565*4882a593Smuzhiyuncalibrated against an external value may have a range of valid values. 566*4882a593SmuzhiyunRe-calibrating this computation may actually cause time, as computed after the 567*4882a593Smuzhiyuncalibration, to go backwards, compared with time computed before the 568*4882a593Smuzhiyuncalibration. 569*4882a593Smuzhiyun 570*4882a593SmuzhiyunThis problem is particularly pronounced with an internal time source in Linux, 571*4882a593Smuzhiyunthe kernel time, which is expressed in the theoretically high resolution 572*4882a593Smuzhiyuntimespec - but which advances in much larger granularity intervals, sometimes 573*4882a593Smuzhiyunat the rate of jiffies, and possibly in catchup modes, at a much larger step. 574*4882a593Smuzhiyun 575*4882a593SmuzhiyunThis aliasing requires care in the computation and recalibration of kvmclock 576*4882a593Smuzhiyunand any other values derived from TSC computation (such as TSC virtualization 577*4882a593Smuzhiyunitself). 578*4882a593Smuzhiyun 579*4882a593Smuzhiyun4.4. Migration 580*4882a593Smuzhiyun-------------- 581*4882a593Smuzhiyun 582*4882a593SmuzhiyunMigration of a virtual machine raises problems for timekeeping in two ways. 583*4882a593SmuzhiyunFirst, the migration itself may take time, during which interrupts cannot be 584*4882a593Smuzhiyundelivered, and after which, the guest time may need to be caught up. NTP may 585*4882a593Smuzhiyunbe able to help to some degree here, as the clock correction required is 586*4882a593Smuzhiyuntypically small enough to fall in the NTP-correctable window. 587*4882a593Smuzhiyun 588*4882a593SmuzhiyunAn additional concern is that timers based off the TSC (or HPET, if the raw bus 589*4882a593Smuzhiyunclock is exposed) may now be running at different rates, requiring compensation 590*4882a593Smuzhiyunin some way in the hypervisor by virtualizing these timers. In addition, 591*4882a593Smuzhiyunmigrating to a faster machine may preclude the use of a passthrough TSC, as a 592*4882a593Smuzhiyunfaster clock cannot be made visible to a guest without the potential of time 593*4882a593Smuzhiyunadvancing faster than usual. A slower clock is less of a problem, as it can 594*4882a593Smuzhiyunalways be caught up to the original rate. KVM clock avoids these problems by 595*4882a593Smuzhiyunsimply storing multipliers and offsets against the TSC for the guest to convert 596*4882a593Smuzhiyunback into nanosecond resolution values. 597*4882a593Smuzhiyun 598*4882a593Smuzhiyun4.5. Scheduling 599*4882a593Smuzhiyun--------------- 600*4882a593Smuzhiyun 601*4882a593SmuzhiyunSince scheduling may be based on precise timing and firing of interrupts, the 602*4882a593Smuzhiyunscheduling algorithms of an operating system may be adversely affected by 603*4882a593Smuzhiyunvirtualization. In theory, the effect is random and should be universally 604*4882a593Smuzhiyundistributed, but in contrived as well as real scenarios (guest device access, 605*4882a593Smuzhiyuncauses of virtualization exits, possible context switch), this may not always 606*4882a593Smuzhiyunbe the case. The effect of this has not been well studied. 607*4882a593Smuzhiyun 608*4882a593SmuzhiyunIn an attempt to work around this, several implementations have provided a 609*4882a593Smuzhiyunparavirtualized scheduler clock, which reveals the true amount of CPU time for 610*4882a593Smuzhiyunwhich a virtual machine has been running. 611*4882a593Smuzhiyun 612*4882a593Smuzhiyun4.6. Watchdogs 613*4882a593Smuzhiyun-------------- 614*4882a593Smuzhiyun 615*4882a593SmuzhiyunWatchdog timers, such as the lock detector in Linux may fire accidentally when 616*4882a593Smuzhiyunrunning under hardware virtualization due to timer interrupts being delayed or 617*4882a593Smuzhiyunmisinterpretation of the passage of real time. Usually, these warnings are 618*4882a593Smuzhiyunspurious and can be ignored, but in some circumstances it may be necessary to 619*4882a593Smuzhiyundisable such detection. 620*4882a593Smuzhiyun 621*4882a593Smuzhiyun4.7. Delays and precision timing 622*4882a593Smuzhiyun-------------------------------- 623*4882a593Smuzhiyun 624*4882a593SmuzhiyunPrecise timing and delays may not be possible in a virtualized system. This 625*4882a593Smuzhiyuncan happen if the system is controlling physical hardware, or issues delays to 626*4882a593Smuzhiyuncompensate for slower I/O to and from devices. The first issue is not solvable 627*4882a593Smuzhiyunin general for a virtualized system; hardware control software can't be 628*4882a593Smuzhiyunadequately virtualized without a full real-time operating system, which would 629*4882a593Smuzhiyunrequire an RT aware virtualization platform. 630*4882a593Smuzhiyun 631*4882a593SmuzhiyunThe second issue may cause performance problems, but this is unlikely to be a 632*4882a593Smuzhiyunsignificant issue. In many cases these delays may be eliminated through 633*4882a593Smuzhiyunconfiguration or paravirtualization. 634*4882a593Smuzhiyun 635*4882a593Smuzhiyun4.8. Covert channels and leaks 636*4882a593Smuzhiyun------------------------------ 637*4882a593Smuzhiyun 638*4882a593SmuzhiyunIn addition to the above problems, time information will inevitably leak to the 639*4882a593Smuzhiyunguest about the host in anything but a perfect implementation of virtualized 640*4882a593Smuzhiyuntime. This may allow the guest to infer the presence of a hypervisor (as in a 641*4882a593Smuzhiyunred-pill type detection), and it may allow information to leak between guests 642*4882a593Smuzhiyunby using CPU utilization itself as a signalling channel. Preventing such 643*4882a593Smuzhiyunproblems would require completely isolated virtual time which may not track 644*4882a593Smuzhiyunreal time any longer. This may be useful in certain security or QA contexts, 645*4882a593Smuzhiyunbut in general isn't recommended for real-world deployment scenarios. 646