xref: /OK3568_Linux_fs/kernel/Documentation/virt/kvm/timekeeping.rst (revision 4882a59341e53eb6f0b4789bf948001014eff981)
1*4882a593Smuzhiyun.. SPDX-License-Identifier: GPL-2.0
2*4882a593Smuzhiyun
3*4882a593Smuzhiyun======================================================
4*4882a593SmuzhiyunTimekeeping Virtualization for X86-Based Architectures
5*4882a593Smuzhiyun======================================================
6*4882a593Smuzhiyun
7*4882a593Smuzhiyun:Author: Zachary Amsden <zamsden@redhat.com>
8*4882a593Smuzhiyun:Copyright: (c) 2010, Red Hat.  All rights reserved.
9*4882a593Smuzhiyun
10*4882a593Smuzhiyun.. Contents
11*4882a593Smuzhiyun
12*4882a593Smuzhiyun   1) Overview
13*4882a593Smuzhiyun   2) Timing Devices
14*4882a593Smuzhiyun   3) TSC Hardware
15*4882a593Smuzhiyun   4) Virtualization Problems
16*4882a593Smuzhiyun
17*4882a593Smuzhiyun1. Overview
18*4882a593Smuzhiyun===========
19*4882a593Smuzhiyun
20*4882a593SmuzhiyunOne of the most complicated parts of the X86 platform, and specifically,
21*4882a593Smuzhiyunthe virtualization of this platform is the plethora of timing devices available
22*4882a593Smuzhiyunand the complexity of emulating those devices.  In addition, virtualization of
23*4882a593Smuzhiyuntime introduces a new set of challenges because it introduces a multiplexed
24*4882a593Smuzhiyundivision of time beyond the control of the guest CPU.
25*4882a593Smuzhiyun
26*4882a593SmuzhiyunFirst, we will describe the various timekeeping hardware available, then
27*4882a593Smuzhiyunpresent some of the problems which arise and solutions available, giving
28*4882a593Smuzhiyunspecific recommendations for certain classes of KVM guests.
29*4882a593Smuzhiyun
30*4882a593SmuzhiyunThe purpose of this document is to collect data and information relevant to
31*4882a593Smuzhiyuntimekeeping which may be difficult to find elsewhere, specifically,
32*4882a593Smuzhiyuninformation relevant to KVM and hardware-based virtualization.
33*4882a593Smuzhiyun
34*4882a593Smuzhiyun2. Timing Devices
35*4882a593Smuzhiyun=================
36*4882a593Smuzhiyun
37*4882a593SmuzhiyunFirst we discuss the basic hardware devices available.  TSC and the related
38*4882a593SmuzhiyunKVM clock are special enough to warrant a full exposition and are described in
39*4882a593Smuzhiyunthe following section.
40*4882a593Smuzhiyun
41*4882a593Smuzhiyun2.1. i8254 - PIT
42*4882a593Smuzhiyun----------------
43*4882a593Smuzhiyun
44*4882a593SmuzhiyunOne of the first timer devices available is the programmable interrupt timer,
45*4882a593Smuzhiyunor PIT.  The PIT has a fixed frequency 1.193182 MHz base clock and three
46*4882a593Smuzhiyunchannels which can be programmed to deliver periodic or one-shot interrupts.
47*4882a593SmuzhiyunThese three channels can be configured in different modes and have individual
48*4882a593Smuzhiyuncounters.  Channel 1 and 2 were not available for general use in the original
49*4882a593SmuzhiyunIBM PC, and historically were connected to control RAM refresh and the PC
50*4882a593Smuzhiyunspeaker.  Now the PIT is typically integrated as part of an emulated chipset
51*4882a593Smuzhiyunand a separate physical PIT is not used.
52*4882a593Smuzhiyun
53*4882a593SmuzhiyunThe PIT uses I/O ports 0x40 - 0x43.  Access to the 16-bit counters is done
54*4882a593Smuzhiyunusing single or multiple byte access to the I/O ports.  There are 6 modes
55*4882a593Smuzhiyunavailable, but not all modes are available to all timers, as only timer 2
56*4882a593Smuzhiyunhas a connected gate input, required for modes 1 and 5.  The gate line is
57*4882a593Smuzhiyuncontrolled by port 61h, bit 0, as illustrated in the following diagram::
58*4882a593Smuzhiyun
59*4882a593Smuzhiyun  --------------             ----------------
60*4882a593Smuzhiyun  |            |           |                |
61*4882a593Smuzhiyun  |  1.1932 MHz|---------->| CLOCK      OUT | ---------> IRQ 0
62*4882a593Smuzhiyun  |    Clock   |   |       |                |
63*4882a593Smuzhiyun  --------------   |    +->| GATE  TIMER 0  |
64*4882a593Smuzhiyun                   |        ----------------
65*4882a593Smuzhiyun                   |
66*4882a593Smuzhiyun                   |        ----------------
67*4882a593Smuzhiyun                   |       |                |
68*4882a593Smuzhiyun                   |------>| CLOCK      OUT | ---------> 66.3 KHZ DRAM
69*4882a593Smuzhiyun                   |       |                |            (aka /dev/null)
70*4882a593Smuzhiyun                   |    +->| GATE  TIMER 1  |
71*4882a593Smuzhiyun                   |        ----------------
72*4882a593Smuzhiyun                   |
73*4882a593Smuzhiyun                   |        ----------------
74*4882a593Smuzhiyun                   |       |                |
75*4882a593Smuzhiyun                   |------>| CLOCK      OUT | ---------> Port 61h, bit 5
76*4882a593Smuzhiyun                           |                |      |
77*4882a593Smuzhiyun  Port 61h, bit 0 -------->| GATE  TIMER 2  |       \_.----   ____
78*4882a593Smuzhiyun                            ----------------         _|    )--|LPF|---Speaker
79*4882a593Smuzhiyun                                                    / *----   \___/
80*4882a593Smuzhiyun  Port 61h, bit 1 ---------------------------------/
81*4882a593Smuzhiyun
82*4882a593SmuzhiyunThe timer modes are now described.
83*4882a593Smuzhiyun
84*4882a593SmuzhiyunMode 0: Single Timeout.
85*4882a593Smuzhiyun This is a one-shot software timeout that counts down
86*4882a593Smuzhiyun when the gate is high (always true for timers 0 and 1).  When the count
87*4882a593Smuzhiyun reaches zero, the output goes high.
88*4882a593Smuzhiyun
89*4882a593SmuzhiyunMode 1: Triggered One-shot.
90*4882a593Smuzhiyun The output is initially set high.  When the gate
91*4882a593Smuzhiyun line is set high, a countdown is initiated (which does not stop if the gate is
92*4882a593Smuzhiyun lowered), during which the output is set low.  When the count reaches zero,
93*4882a593Smuzhiyun the output goes high.
94*4882a593Smuzhiyun
95*4882a593SmuzhiyunMode 2: Rate Generator.
96*4882a593Smuzhiyun The output is initially set high.  When the countdown
97*4882a593Smuzhiyun reaches 1, the output goes low for one count and then returns high.  The value
98*4882a593Smuzhiyun is reloaded and the countdown automatically resumes.  If the gate line goes
99*4882a593Smuzhiyun low, the count is halted.  If the output is low when the gate is lowered, the
100*4882a593Smuzhiyun output automatically goes high (this only affects timer 2).
101*4882a593Smuzhiyun
102*4882a593SmuzhiyunMode 3: Square Wave.
103*4882a593Smuzhiyun This generates a high / low square wave.  The count
104*4882a593Smuzhiyun determines the length of the pulse, which alternates between high and low
105*4882a593Smuzhiyun when zero is reached.  The count only proceeds when gate is high and is
106*4882a593Smuzhiyun automatically reloaded on reaching zero.  The count is decremented twice at
107*4882a593Smuzhiyun each clock to generate a full high / low cycle at the full periodic rate.
108*4882a593Smuzhiyun If the count is even, the clock remains high for N/2 counts and low for N/2
109*4882a593Smuzhiyun counts; if the clock is odd, the clock is high for (N+1)/2 counts and low
110*4882a593Smuzhiyun for (N-1)/2 counts.  Only even values are latched by the counter, so odd
111*4882a593Smuzhiyun values are not observed when reading.  This is the intended mode for timer 2,
112*4882a593Smuzhiyun which generates sine-like tones by low-pass filtering the square wave output.
113*4882a593Smuzhiyun
114*4882a593SmuzhiyunMode 4: Software Strobe.
115*4882a593Smuzhiyun After programming this mode and loading the counter,
116*4882a593Smuzhiyun the output remains high until the counter reaches zero.  Then the output
117*4882a593Smuzhiyun goes low for 1 clock cycle and returns high.  The counter is not reloaded.
118*4882a593Smuzhiyun Counting only occurs when gate is high.
119*4882a593Smuzhiyun
120*4882a593SmuzhiyunMode 5: Hardware Strobe.
121*4882a593Smuzhiyun After programming and loading the counter, the
122*4882a593Smuzhiyun output remains high.  When the gate is raised, a countdown is initiated
123*4882a593Smuzhiyun (which does not stop if the gate is lowered).  When the counter reaches zero,
124*4882a593Smuzhiyun the output goes low for 1 clock cycle and then returns high.  The counter is
125*4882a593Smuzhiyun not reloaded.
126*4882a593Smuzhiyun
127*4882a593SmuzhiyunIn addition to normal binary counting, the PIT supports BCD counting.  The
128*4882a593Smuzhiyuncommand port, 0x43 is used to set the counter and mode for each of the three
129*4882a593Smuzhiyuntimers.
130*4882a593Smuzhiyun
131*4882a593SmuzhiyunPIT commands, issued to port 0x43, using the following bit encoding::
132*4882a593Smuzhiyun
133*4882a593Smuzhiyun  Bit 7-4: Command (See table below)
134*4882a593Smuzhiyun  Bit 3-1: Mode (000 = Mode 0, 101 = Mode 5, 11X = undefined)
135*4882a593Smuzhiyun  Bit 0  : Binary (0) / BCD (1)
136*4882a593Smuzhiyun
137*4882a593SmuzhiyunCommand table::
138*4882a593Smuzhiyun
139*4882a593Smuzhiyun  0000 - Latch Timer 0 count for port 0x40
140*4882a593Smuzhiyun	sample and hold the count to be read in port 0x40;
141*4882a593Smuzhiyun	additional commands ignored until counter is read;
142*4882a593Smuzhiyun	mode bits ignored.
143*4882a593Smuzhiyun
144*4882a593Smuzhiyun  0001 - Set Timer 0 LSB mode for port 0x40
145*4882a593Smuzhiyun	set timer to read LSB only and force MSB to zero;
146*4882a593Smuzhiyun	mode bits set timer mode
147*4882a593Smuzhiyun
148*4882a593Smuzhiyun  0010 - Set Timer 0 MSB mode for port 0x40
149*4882a593Smuzhiyun	set timer to read MSB only and force LSB to zero;
150*4882a593Smuzhiyun	mode bits set timer mode
151*4882a593Smuzhiyun
152*4882a593Smuzhiyun  0011 - Set Timer 0 16-bit mode for port 0x40
153*4882a593Smuzhiyun	set timer to read / write LSB first, then MSB;
154*4882a593Smuzhiyun	mode bits set timer mode
155*4882a593Smuzhiyun
156*4882a593Smuzhiyun  0100 - Latch Timer 1 count for port 0x41 - as described above
157*4882a593Smuzhiyun  0101 - Set Timer 1 LSB mode for port 0x41 - as described above
158*4882a593Smuzhiyun  0110 - Set Timer 1 MSB mode for port 0x41 - as described above
159*4882a593Smuzhiyun  0111 - Set Timer 1 16-bit mode for port 0x41 - as described above
160*4882a593Smuzhiyun
161*4882a593Smuzhiyun  1000 - Latch Timer 2 count for port 0x42 - as described above
162*4882a593Smuzhiyun  1001 - Set Timer 2 LSB mode for port 0x42 - as described above
163*4882a593Smuzhiyun  1010 - Set Timer 2 MSB mode for port 0x42 - as described above
164*4882a593Smuzhiyun  1011 - Set Timer 2 16-bit mode for port 0x42 as described above
165*4882a593Smuzhiyun
166*4882a593Smuzhiyun  1101 - General counter latch
167*4882a593Smuzhiyun	Latch combination of counters into corresponding ports
168*4882a593Smuzhiyun	Bit 3 = Counter 2
169*4882a593Smuzhiyun	Bit 2 = Counter 1
170*4882a593Smuzhiyun	Bit 1 = Counter 0
171*4882a593Smuzhiyun	Bit 0 = Unused
172*4882a593Smuzhiyun
173*4882a593Smuzhiyun  1110 - Latch timer status
174*4882a593Smuzhiyun	Latch combination of counter mode into corresponding ports
175*4882a593Smuzhiyun	Bit 3 = Counter 2
176*4882a593Smuzhiyun	Bit 2 = Counter 1
177*4882a593Smuzhiyun	Bit 1 = Counter 0
178*4882a593Smuzhiyun
179*4882a593Smuzhiyun	The output of ports 0x40-0x42 following this command will be:
180*4882a593Smuzhiyun
181*4882a593Smuzhiyun	Bit 7 = Output pin
182*4882a593Smuzhiyun	Bit 6 = Count loaded (0 if timer has expired)
183*4882a593Smuzhiyun	Bit 5-4 = Read / Write mode
184*4882a593Smuzhiyun	    01 = MSB only
185*4882a593Smuzhiyun	    10 = LSB only
186*4882a593Smuzhiyun	    11 = LSB / MSB (16-bit)
187*4882a593Smuzhiyun	Bit 3-1 = Mode
188*4882a593Smuzhiyun	Bit 0 = Binary (0) / BCD mode (1)
189*4882a593Smuzhiyun
190*4882a593Smuzhiyun2.2. RTC
191*4882a593Smuzhiyun--------
192*4882a593Smuzhiyun
193*4882a593SmuzhiyunThe second device which was available in the original PC was the MC146818 real
194*4882a593Smuzhiyuntime clock.  The original device is now obsolete, and usually emulated by the
195*4882a593Smuzhiyunsystem chipset, sometimes by an HPET and some frankenstein IRQ routing.
196*4882a593Smuzhiyun
197*4882a593SmuzhiyunThe RTC is accessed through CMOS variables, which uses an index register to
198*4882a593Smuzhiyuncontrol which bytes are read.  Since there is only one index register, read
199*4882a593Smuzhiyunof the CMOS and read of the RTC require lock protection (in addition, it is
200*4882a593Smuzhiyundangerous to allow userspace utilities such as hwclock to have direct RTC
201*4882a593Smuzhiyunaccess, as they could corrupt kernel reads and writes of CMOS memory).
202*4882a593Smuzhiyun
203*4882a593SmuzhiyunThe RTC generates an interrupt which is usually routed to IRQ 8.  The interrupt
204*4882a593Smuzhiyuncan function as a periodic timer, an additional once a day alarm, and can issue
205*4882a593Smuzhiyuninterrupts after an update of the CMOS registers by the MC146818 is complete.
206*4882a593SmuzhiyunThe type of interrupt is signalled in the RTC status registers.
207*4882a593Smuzhiyun
208*4882a593SmuzhiyunThe RTC will update the current time fields by battery power even while the
209*4882a593Smuzhiyunsystem is off.  The current time fields should not be read while an update is
210*4882a593Smuzhiyunin progress, as indicated in the status register.
211*4882a593Smuzhiyun
212*4882a593SmuzhiyunThe clock uses a 32.768kHz crystal, so bits 6-4 of register A should be
213*4882a593Smuzhiyunprogrammed to a 32kHz divider if the RTC is to count seconds.
214*4882a593Smuzhiyun
215*4882a593SmuzhiyunThis is the RAM map originally used for the RTC/CMOS::
216*4882a593Smuzhiyun
217*4882a593Smuzhiyun  Location    Size    Description
218*4882a593Smuzhiyun  ------------------------------------------
219*4882a593Smuzhiyun  00h         byte    Current second (BCD)
220*4882a593Smuzhiyun  01h         byte    Seconds alarm (BCD)
221*4882a593Smuzhiyun  02h         byte    Current minute (BCD)
222*4882a593Smuzhiyun  03h         byte    Minutes alarm (BCD)
223*4882a593Smuzhiyun  04h         byte    Current hour (BCD)
224*4882a593Smuzhiyun  05h         byte    Hours alarm (BCD)
225*4882a593Smuzhiyun  06h         byte    Current day of week (BCD)
226*4882a593Smuzhiyun  07h         byte    Current day of month (BCD)
227*4882a593Smuzhiyun  08h         byte    Current month (BCD)
228*4882a593Smuzhiyun  09h         byte    Current year (BCD)
229*4882a593Smuzhiyun  0Ah         byte    Register A
230*4882a593Smuzhiyun                       bit 7   = Update in progress
231*4882a593Smuzhiyun                       bit 6-4 = Divider for clock
232*4882a593Smuzhiyun                                  000 = 4.194 MHz
233*4882a593Smuzhiyun                                  001 = 1.049 MHz
234*4882a593Smuzhiyun                                  010 = 32 kHz
235*4882a593Smuzhiyun                                  10X = test modes
236*4882a593Smuzhiyun                                  110 = reset / disable
237*4882a593Smuzhiyun                                  111 = reset / disable
238*4882a593Smuzhiyun                       bit 3-0 = Rate selection for periodic interrupt
239*4882a593Smuzhiyun                                  000 = periodic timer disabled
240*4882a593Smuzhiyun                                  001 = 3.90625 uS
241*4882a593Smuzhiyun                                  010 = 7.8125 uS
242*4882a593Smuzhiyun                                  011 = .122070 mS
243*4882a593Smuzhiyun                                  100 = .244141 mS
244*4882a593Smuzhiyun                                     ...
245*4882a593Smuzhiyun                                 1101 = 125 mS
246*4882a593Smuzhiyun                                 1110 = 250 mS
247*4882a593Smuzhiyun                                 1111 = 500 mS
248*4882a593Smuzhiyun  0Bh         byte    Register B
249*4882a593Smuzhiyun                       bit 7   = Run (0) / Halt (1)
250*4882a593Smuzhiyun                       bit 6   = Periodic interrupt enable
251*4882a593Smuzhiyun                       bit 5   = Alarm interrupt enable
252*4882a593Smuzhiyun                       bit 4   = Update-ended interrupt enable
253*4882a593Smuzhiyun                       bit 3   = Square wave interrupt enable
254*4882a593Smuzhiyun                       bit 2   = BCD calendar (0) / Binary (1)
255*4882a593Smuzhiyun                       bit 1   = 12-hour mode (0) / 24-hour mode (1)
256*4882a593Smuzhiyun                       bit 0   = 0 (DST off) / 1 (DST enabled)
257*4882a593Smuzhiyun  OCh         byte    Register C (read only)
258*4882a593Smuzhiyun                       bit 7   = interrupt request flag (IRQF)
259*4882a593Smuzhiyun                       bit 6   = periodic interrupt flag (PF)
260*4882a593Smuzhiyun                       bit 5   = alarm interrupt flag (AF)
261*4882a593Smuzhiyun                       bit 4   = update interrupt flag (UF)
262*4882a593Smuzhiyun                       bit 3-0 = reserved
263*4882a593Smuzhiyun  ODh         byte    Register D (read only)
264*4882a593Smuzhiyun                       bit 7   = RTC has power
265*4882a593Smuzhiyun                       bit 6-0 = reserved
266*4882a593Smuzhiyun  32h         byte    Current century BCD (*)
267*4882a593Smuzhiyun  (*) location vendor specific and now determined from ACPI global tables
268*4882a593Smuzhiyun
269*4882a593Smuzhiyun2.3. APIC
270*4882a593Smuzhiyun---------
271*4882a593Smuzhiyun
272*4882a593SmuzhiyunOn Pentium and later processors, an on-board timer is available to each CPU
273*4882a593Smuzhiyunas part of the Advanced Programmable Interrupt Controller.  The APIC is
274*4882a593Smuzhiyunaccessed through memory-mapped registers and provides interrupt service to each
275*4882a593SmuzhiyunCPU, used for IPIs and local timer interrupts.
276*4882a593Smuzhiyun
277*4882a593SmuzhiyunAlthough in theory the APIC is a safe and stable source for local interrupts,
278*4882a593Smuzhiyunin practice, many bugs and glitches have occurred due to the special nature of
279*4882a593Smuzhiyunthe APIC CPU-local memory-mapped hardware.  Beware that CPU errata may affect
280*4882a593Smuzhiyunthe use of the APIC and that workarounds may be required.  In addition, some of
281*4882a593Smuzhiyunthese workarounds pose unique constraints for virtualization - requiring either
282*4882a593Smuzhiyunextra overhead incurred from extra reads of memory-mapped I/O or additional
283*4882a593Smuzhiyunfunctionality that may be more computationally expensive to implement.
284*4882a593Smuzhiyun
285*4882a593SmuzhiyunSince the APIC is documented quite well in the Intel and AMD manuals, we will
286*4882a593Smuzhiyunavoid repetition of the detail here.  It should be pointed out that the APIC
287*4882a593Smuzhiyuntimer is programmed through the LVT (local vector timer) register, is capable
288*4882a593Smuzhiyunof one-shot or periodic operation, and is based on the bus clock divided down
289*4882a593Smuzhiyunby the programmable divider register.
290*4882a593Smuzhiyun
291*4882a593Smuzhiyun2.4. HPET
292*4882a593Smuzhiyun---------
293*4882a593Smuzhiyun
294*4882a593SmuzhiyunHPET is quite complex, and was originally intended to replace the PIT / RTC
295*4882a593Smuzhiyunsupport of the X86 PC.  It remains to be seen whether that will be the case, as
296*4882a593Smuzhiyunthe de facto standard of PC hardware is to emulate these older devices.  Some
297*4882a593Smuzhiyunsystems designated as legacy free may support only the HPET as a hardware timer
298*4882a593Smuzhiyundevice.
299*4882a593Smuzhiyun
300*4882a593SmuzhiyunThe HPET spec is rather loose and vague, requiring at least 3 hardware timers,
301*4882a593Smuzhiyunbut allowing implementation freedom to support many more.  It also imposes no
302*4882a593Smuzhiyunfixed rate on the timer frequency, but does impose some extremal values on
303*4882a593Smuzhiyunfrequency, error and slew.
304*4882a593Smuzhiyun
305*4882a593SmuzhiyunIn general, the HPET is recommended as a high precision (compared to PIT /RTC)
306*4882a593Smuzhiyuntime source which is independent of local variation (as there is only one HPET
307*4882a593Smuzhiyunin any given system).  The HPET is also memory-mapped, and its presence is
308*4882a593Smuzhiyunindicated through ACPI tables by the BIOS.
309*4882a593Smuzhiyun
310*4882a593SmuzhiyunDetailed specification of the HPET is beyond the current scope of this
311*4882a593Smuzhiyundocument, as it is also very well documented elsewhere.
312*4882a593Smuzhiyun
313*4882a593Smuzhiyun2.5. Offboard Timers
314*4882a593Smuzhiyun--------------------
315*4882a593Smuzhiyun
316*4882a593SmuzhiyunSeveral cards, both proprietary (watchdog boards) and commonplace (e1000) have
317*4882a593Smuzhiyuntiming chips built into the cards which may have registers which are accessible
318*4882a593Smuzhiyunto kernel or user drivers.  To the author's knowledge, using these to generate
319*4882a593Smuzhiyuna clocksource for a Linux or other kernel has not yet been attempted and is in
320*4882a593Smuzhiyungeneral frowned upon as not playing by the agreed rules of the game.  Such a
321*4882a593Smuzhiyuntimer device would require additional support to be virtualized properly and is
322*4882a593Smuzhiyunnot considered important at this time as no known operating system does this.
323*4882a593Smuzhiyun
324*4882a593Smuzhiyun3. TSC Hardware
325*4882a593Smuzhiyun===============
326*4882a593Smuzhiyun
327*4882a593SmuzhiyunThe TSC or time stamp counter is relatively simple in theory; it counts
328*4882a593Smuzhiyuninstruction cycles issued by the processor, which can be used as a measure of
329*4882a593Smuzhiyuntime.  In practice, due to a number of problems, it is the most complicated
330*4882a593Smuzhiyuntimekeeping device to use.
331*4882a593Smuzhiyun
332*4882a593SmuzhiyunThe TSC is represented internally as a 64-bit MSR which can be read with the
333*4882a593SmuzhiyunRDMSR, RDTSC, or RDTSCP (when available) instructions.  In the past, hardware
334*4882a593Smuzhiyunlimitations made it possible to write the TSC, but generally on old hardware it
335*4882a593Smuzhiyunwas only possible to write the low 32-bits of the 64-bit counter, and the upper
336*4882a593Smuzhiyun32-bits of the counter were cleared.  Now, however, on Intel processors family
337*4882a593Smuzhiyun0Fh, for models 3, 4 and 6, and family 06h, models e and f, this restriction
338*4882a593Smuzhiyunhas been lifted and all 64-bits are writable.  On AMD systems, the ability to
339*4882a593Smuzhiyunwrite the TSC MSR is not an architectural guarantee.
340*4882a593Smuzhiyun
341*4882a593SmuzhiyunThe TSC is accessible from CPL-0 and conditionally, for CPL > 0 software by
342*4882a593Smuzhiyunmeans of the CR4.TSD bit, which when enabled, disables CPL > 0 TSC access.
343*4882a593Smuzhiyun
344*4882a593SmuzhiyunSome vendors have implemented an additional instruction, RDTSCP, which returns
345*4882a593Smuzhiyunatomically not just the TSC, but an indicator which corresponds to the
346*4882a593Smuzhiyunprocessor number.  This can be used to index into an array of TSC variables to
347*4882a593Smuzhiyundetermine offset information in SMP systems where TSCs are not synchronized.
348*4882a593SmuzhiyunThe presence of this instruction must be determined by consulting CPUID feature
349*4882a593Smuzhiyunbits.
350*4882a593Smuzhiyun
351*4882a593SmuzhiyunBoth VMX and SVM provide extension fields in the virtualization hardware which
352*4882a593Smuzhiyunallows the guest visible TSC to be offset by a constant.  Newer implementations
353*4882a593Smuzhiyunpromise to allow the TSC to additionally be scaled, but this hardware is not
354*4882a593Smuzhiyunyet widely available.
355*4882a593Smuzhiyun
356*4882a593Smuzhiyun3.1. TSC synchronization
357*4882a593Smuzhiyun------------------------
358*4882a593Smuzhiyun
359*4882a593SmuzhiyunThe TSC is a CPU-local clock in most implementations.  This means, on SMP
360*4882a593Smuzhiyunplatforms, the TSCs of different CPUs may start at different times depending
361*4882a593Smuzhiyunon when the CPUs are powered on.  Generally, CPUs on the same die will share
362*4882a593Smuzhiyunthe same clock, however, this is not always the case.
363*4882a593Smuzhiyun
364*4882a593SmuzhiyunThe BIOS may attempt to resynchronize the TSCs during the poweron process and
365*4882a593Smuzhiyunthe operating system or other system software may attempt to do this as well.
366*4882a593SmuzhiyunSeveral hardware limitations make the problem worse - if it is not possible to
367*4882a593Smuzhiyunwrite the full 64-bits of the TSC, it may be impossible to match the TSC in
368*4882a593Smuzhiyunnewly arriving CPUs to that of the rest of the system, resulting in
369*4882a593Smuzhiyununsynchronized TSCs.  This may be done by BIOS or system software, but in
370*4882a593Smuzhiyunpractice, getting a perfectly synchronized TSC will not be possible unless all
371*4882a593Smuzhiyunvalues are read from the same clock, which generally only is possible on single
372*4882a593Smuzhiyunsocket systems or those with special hardware support.
373*4882a593Smuzhiyun
374*4882a593Smuzhiyun3.2. TSC and CPU hotplug
375*4882a593Smuzhiyun------------------------
376*4882a593Smuzhiyun
377*4882a593SmuzhiyunAs touched on already, CPUs which arrive later than the boot time of the system
378*4882a593Smuzhiyunmay not have a TSC value that is synchronized with the rest of the system.
379*4882a593SmuzhiyunEither system software, BIOS, or SMM code may actually try to establish the TSC
380*4882a593Smuzhiyunto a value matching the rest of the system, but a perfect match is usually not
381*4882a593Smuzhiyuna guarantee.  This can have the effect of bringing a system from a state where
382*4882a593SmuzhiyunTSC is synchronized back to a state where TSC synchronization flaws, however
383*4882a593Smuzhiyunsmall, may be exposed to the OS and any virtualization environment.
384*4882a593Smuzhiyun
385*4882a593Smuzhiyun3.3. TSC and multi-socket / NUMA
386*4882a593Smuzhiyun--------------------------------
387*4882a593Smuzhiyun
388*4882a593SmuzhiyunMulti-socket systems, especially large multi-socket systems are likely to have
389*4882a593Smuzhiyunindividual clocksources rather than a single, universally distributed clock.
390*4882a593SmuzhiyunSince these clocks are driven by different crystals, they will not have
391*4882a593Smuzhiyunperfectly matched frequency, and temperature and electrical variations will
392*4882a593Smuzhiyuncause the CPU clocks, and thus the TSCs to drift over time.  Depending on the
393*4882a593Smuzhiyunexact clock and bus design, the drift may or may not be fixed in absolute
394*4882a593Smuzhiyunerror, and may accumulate over time.
395*4882a593Smuzhiyun
396*4882a593SmuzhiyunIn addition, very large systems may deliberately slew the clocks of individual
397*4882a593Smuzhiyuncores.  This technique, known as spread-spectrum clocking, reduces EMI at the
398*4882a593Smuzhiyunclock frequency and harmonics of it, which may be required to pass FCC
399*4882a593Smuzhiyunstandards for telecommunications and computer equipment.
400*4882a593Smuzhiyun
401*4882a593SmuzhiyunIt is recommended not to trust the TSCs to remain synchronized on NUMA or
402*4882a593Smuzhiyunmultiple socket systems for these reasons.
403*4882a593Smuzhiyun
404*4882a593Smuzhiyun3.4. TSC and C-states
405*4882a593Smuzhiyun---------------------
406*4882a593Smuzhiyun
407*4882a593SmuzhiyunC-states, or idling states of the processor, especially C1E and deeper sleep
408*4882a593Smuzhiyunstates may be problematic for TSC as well.  The TSC may stop advancing in such
409*4882a593Smuzhiyuna state, resulting in a TSC which is behind that of other CPUs when execution
410*4882a593Smuzhiyunis resumed.  Such CPUs must be detected and flagged by the operating system
411*4882a593Smuzhiyunbased on CPU and chipset identifications.
412*4882a593Smuzhiyun
413*4882a593SmuzhiyunThe TSC in such a case may be corrected by catching it up to a known external
414*4882a593Smuzhiyunclocksource.
415*4882a593Smuzhiyun
416*4882a593Smuzhiyun3.5. TSC frequency change / P-states
417*4882a593Smuzhiyun------------------------------------
418*4882a593Smuzhiyun
419*4882a593SmuzhiyunTo make things slightly more interesting, some CPUs may change frequency.  They
420*4882a593Smuzhiyunmay or may not run the TSC at the same rate, and because the frequency change
421*4882a593Smuzhiyunmay be staggered or slewed, at some points in time, the TSC rate may not be
422*4882a593Smuzhiyunknown other than falling within a range of values.  In this case, the TSC will
423*4882a593Smuzhiyunnot be a stable time source, and must be calibrated against a known, stable,
424*4882a593Smuzhiyunexternal clock to be a usable source of time.
425*4882a593Smuzhiyun
426*4882a593SmuzhiyunWhether the TSC runs at a constant rate or scales with the P-state is model
427*4882a593Smuzhiyundependent and must be determined by inspecting CPUID, chipset or vendor
428*4882a593Smuzhiyunspecific MSR fields.
429*4882a593Smuzhiyun
430*4882a593SmuzhiyunIn addition, some vendors have known bugs where the P-state is actually
431*4882a593Smuzhiyuncompensated for properly during normal operation, but when the processor is
432*4882a593Smuzhiyuninactive, the P-state may be raised temporarily to service cache misses from
433*4882a593Smuzhiyunother processors.  In such cases, the TSC on halted CPUs could advance faster
434*4882a593Smuzhiyunthan that of non-halted processors.  AMD Turion processors are known to have
435*4882a593Smuzhiyunthis problem.
436*4882a593Smuzhiyun
437*4882a593Smuzhiyun3.6. TSC and STPCLK / T-states
438*4882a593Smuzhiyun------------------------------
439*4882a593Smuzhiyun
440*4882a593SmuzhiyunExternal signals given to the processor may also have the effect of stopping
441*4882a593Smuzhiyunthe TSC.  This is typically done for thermal emergency power control to prevent
442*4882a593Smuzhiyunan overheating condition, and typically, there is no way to detect that this
443*4882a593Smuzhiyuncondition has happened.
444*4882a593Smuzhiyun
445*4882a593Smuzhiyun3.7. TSC virtualization - VMX
446*4882a593Smuzhiyun-----------------------------
447*4882a593Smuzhiyun
448*4882a593SmuzhiyunVMX provides conditional trapping of RDTSC, RDMSR, WRMSR and RDTSCP
449*4882a593Smuzhiyuninstructions, which is enough for full virtualization of TSC in any manner.  In
450*4882a593Smuzhiyunaddition, VMX allows passing through the host TSC plus an additional TSC_OFFSET
451*4882a593Smuzhiyunfield specified in the VMCS.  Special instructions must be used to read and
452*4882a593Smuzhiyunwrite the VMCS field.
453*4882a593Smuzhiyun
454*4882a593Smuzhiyun3.8. TSC virtualization - SVM
455*4882a593Smuzhiyun-----------------------------
456*4882a593Smuzhiyun
457*4882a593SmuzhiyunSVM provides conditional trapping of RDTSC, RDMSR, WRMSR and RDTSCP
458*4882a593Smuzhiyuninstructions, which is enough for full virtualization of TSC in any manner.  In
459*4882a593Smuzhiyunaddition, SVM allows passing through the host TSC plus an additional offset
460*4882a593Smuzhiyunfield specified in the SVM control block.
461*4882a593Smuzhiyun
462*4882a593Smuzhiyun3.9. TSC feature bits in Linux
463*4882a593Smuzhiyun------------------------------
464*4882a593Smuzhiyun
465*4882a593SmuzhiyunIn summary, there is no way to guarantee the TSC remains in perfect
466*4882a593Smuzhiyunsynchronization unless it is explicitly guaranteed by the architecture.  Even
467*4882a593Smuzhiyunif so, the TSCs in multi-sockets or NUMA systems may still run independently
468*4882a593Smuzhiyundespite being locally consistent.
469*4882a593Smuzhiyun
470*4882a593SmuzhiyunThe following feature bits are used by Linux to signal various TSC attributes,
471*4882a593Smuzhiyunbut they can only be taken to be meaningful for UP or single node systems.
472*4882a593Smuzhiyun
473*4882a593Smuzhiyun=========================	=======================================
474*4882a593SmuzhiyunX86_FEATURE_TSC			The TSC is available in hardware
475*4882a593SmuzhiyunX86_FEATURE_RDTSCP		The RDTSCP instruction is available
476*4882a593SmuzhiyunX86_FEATURE_CONSTANT_TSC	The TSC rate is unchanged with P-states
477*4882a593SmuzhiyunX86_FEATURE_NONSTOP_TSC		The TSC does not stop in C-states
478*4882a593SmuzhiyunX86_FEATURE_TSC_RELIABLE	TSC sync checks are skipped (VMware)
479*4882a593Smuzhiyun=========================	=======================================
480*4882a593Smuzhiyun
481*4882a593Smuzhiyun4. Virtualization Problems
482*4882a593Smuzhiyun==========================
483*4882a593Smuzhiyun
484*4882a593SmuzhiyunTimekeeping is especially problematic for virtualization because a number of
485*4882a593Smuzhiyunchallenges arise.  The most obvious problem is that time is now shared between
486*4882a593Smuzhiyunthe host and, potentially, a number of virtual machines.  Thus the virtual
487*4882a593Smuzhiyunoperating system does not run with 100% usage of the CPU, despite the fact that
488*4882a593Smuzhiyunit may very well make that assumption.  It may expect it to remain true to very
489*4882a593Smuzhiyunexacting bounds when interrupt sources are disabled, but in reality only its
490*4882a593Smuzhiyunvirtual interrupt sources are disabled, and the machine may still be preempted
491*4882a593Smuzhiyunat any time.  This causes problems as the passage of real time, the injection
492*4882a593Smuzhiyunof machine interrupts and the associated clock sources are no longer completely
493*4882a593Smuzhiyunsynchronized with real time.
494*4882a593Smuzhiyun
495*4882a593SmuzhiyunThis same problem can occur on native hardware to a degree, as SMM mode may
496*4882a593Smuzhiyunsteal cycles from the naturally on X86 systems when SMM mode is used by the
497*4882a593SmuzhiyunBIOS, but not in such an extreme fashion.  However, the fact that SMM mode may
498*4882a593Smuzhiyuncause similar problems to virtualization makes it a good justification for
499*4882a593Smuzhiyunsolving many of these problems on bare metal.
500*4882a593Smuzhiyun
501*4882a593Smuzhiyun4.1. Interrupt clocking
502*4882a593Smuzhiyun-----------------------
503*4882a593Smuzhiyun
504*4882a593SmuzhiyunOne of the most immediate problems that occurs with legacy operating systems
505*4882a593Smuzhiyunis that the system timekeeping routines are often designed to keep track of
506*4882a593Smuzhiyuntime by counting periodic interrupts.  These interrupts may come from the PIT
507*4882a593Smuzhiyunor the RTC, but the problem is the same: the host virtualization engine may not
508*4882a593Smuzhiyunbe able to deliver the proper number of interrupts per second, and so guest
509*4882a593Smuzhiyuntime may fall behind.  This is especially problematic if a high interrupt rate
510*4882a593Smuzhiyunis selected, such as 1000 HZ, which is unfortunately the default for many Linux
511*4882a593Smuzhiyunguests.
512*4882a593Smuzhiyun
513*4882a593SmuzhiyunThere are three approaches to solving this problem; first, it may be possible
514*4882a593Smuzhiyunto simply ignore it.  Guests which have a separate time source for tracking
515*4882a593Smuzhiyun'wall clock' or 'real time' may not need any adjustment of their interrupts to
516*4882a593Smuzhiyunmaintain proper time.  If this is not sufficient, it may be necessary to inject
517*4882a593Smuzhiyunadditional interrupts into the guest in order to increase the effective
518*4882a593Smuzhiyuninterrupt rate.  This approach leads to complications in extreme conditions,
519*4882a593Smuzhiyunwhere host load or guest lag is too much to compensate for, and thus another
520*4882a593Smuzhiyunsolution to the problem has risen: the guest may need to become aware of lost
521*4882a593Smuzhiyunticks and compensate for them internally.  Although promising in theory, the
522*4882a593Smuzhiyunimplementation of this policy in Linux has been extremely error prone, and a
523*4882a593Smuzhiyunnumber of buggy variants of lost tick compensation are distributed across
524*4882a593Smuzhiyuncommonly used Linux systems.
525*4882a593Smuzhiyun
526*4882a593SmuzhiyunWindows uses periodic RTC clocking as a means of keeping time internally, and
527*4882a593Smuzhiyunthus requires interrupt slewing to keep proper time.  It does use a low enough
528*4882a593Smuzhiyunrate (ed: is it 18.2 Hz?) however that it has not yet been a problem in
529*4882a593Smuzhiyunpractice.
530*4882a593Smuzhiyun
531*4882a593Smuzhiyun4.2. TSC sampling and serialization
532*4882a593Smuzhiyun-----------------------------------
533*4882a593Smuzhiyun
534*4882a593SmuzhiyunAs the highest precision time source available, the cycle counter of the CPU
535*4882a593Smuzhiyunhas aroused much interest from developers.  As explained above, this timer has
536*4882a593Smuzhiyunmany problems unique to its nature as a local, potentially unstable and
537*4882a593Smuzhiyunpotentially unsynchronized source.  One issue which is not unique to the TSC,
538*4882a593Smuzhiyunbut is highlighted because of its very precise nature is sampling delay.  By
539*4882a593Smuzhiyundefinition, the counter, once read is already old.  However, it is also
540*4882a593Smuzhiyunpossible for the counter to be read ahead of the actual use of the result.
541*4882a593SmuzhiyunThis is a consequence of the superscalar execution of the instruction stream,
542*4882a593Smuzhiyunwhich may execute instructions out of order.  Such execution is called
543*4882a593Smuzhiyunnon-serialized.  Forcing serialized execution is necessary for precise
544*4882a593Smuzhiyunmeasurement with the TSC, and requires a serializing instruction, such as CPUID
545*4882a593Smuzhiyunor an MSR read.
546*4882a593Smuzhiyun
547*4882a593SmuzhiyunSince CPUID may actually be virtualized by a trap and emulate mechanism, this
548*4882a593Smuzhiyunserialization can pose a performance issue for hardware virtualization.  An
549*4882a593Smuzhiyunaccurate time stamp counter reading may therefore not always be available, and
550*4882a593Smuzhiyunit may be necessary for an implementation to guard against "backwards" reads of
551*4882a593Smuzhiyunthe TSC as seen from other CPUs, even in an otherwise perfectly synchronized
552*4882a593Smuzhiyunsystem.
553*4882a593Smuzhiyun
554*4882a593Smuzhiyun4.3. Timespec aliasing
555*4882a593Smuzhiyun----------------------
556*4882a593Smuzhiyun
557*4882a593SmuzhiyunAdditionally, this lack of serialization from the TSC poses another challenge
558*4882a593Smuzhiyunwhen using results of the TSC when measured against another time source.  As
559*4882a593Smuzhiyunthe TSC is much higher precision, many possible values of the TSC may be read
560*4882a593Smuzhiyunwhile another clock is still expressing the same value.
561*4882a593Smuzhiyun
562*4882a593SmuzhiyunThat is, you may read (T,T+10) while external clock C maintains the same value.
563*4882a593SmuzhiyunDue to non-serialized reads, you may actually end up with a range which
564*4882a593Smuzhiyunfluctuates - from (T-1.. T+10).  Thus, any time calculated from a TSC, but
565*4882a593Smuzhiyuncalibrated against an external value may have a range of valid values.
566*4882a593SmuzhiyunRe-calibrating this computation may actually cause time, as computed after the
567*4882a593Smuzhiyuncalibration, to go backwards, compared with time computed before the
568*4882a593Smuzhiyuncalibration.
569*4882a593Smuzhiyun
570*4882a593SmuzhiyunThis problem is particularly pronounced with an internal time source in Linux,
571*4882a593Smuzhiyunthe kernel time, which is expressed in the theoretically high resolution
572*4882a593Smuzhiyuntimespec - but which advances in much larger granularity intervals, sometimes
573*4882a593Smuzhiyunat the rate of jiffies, and possibly in catchup modes, at a much larger step.
574*4882a593Smuzhiyun
575*4882a593SmuzhiyunThis aliasing requires care in the computation and recalibration of kvmclock
576*4882a593Smuzhiyunand any other values derived from TSC computation (such as TSC virtualization
577*4882a593Smuzhiyunitself).
578*4882a593Smuzhiyun
579*4882a593Smuzhiyun4.4. Migration
580*4882a593Smuzhiyun--------------
581*4882a593Smuzhiyun
582*4882a593SmuzhiyunMigration of a virtual machine raises problems for timekeeping in two ways.
583*4882a593SmuzhiyunFirst, the migration itself may take time, during which interrupts cannot be
584*4882a593Smuzhiyundelivered, and after which, the guest time may need to be caught up.  NTP may
585*4882a593Smuzhiyunbe able to help to some degree here, as the clock correction required is
586*4882a593Smuzhiyuntypically small enough to fall in the NTP-correctable window.
587*4882a593Smuzhiyun
588*4882a593SmuzhiyunAn additional concern is that timers based off the TSC (or HPET, if the raw bus
589*4882a593Smuzhiyunclock is exposed) may now be running at different rates, requiring compensation
590*4882a593Smuzhiyunin some way in the hypervisor by virtualizing these timers.  In addition,
591*4882a593Smuzhiyunmigrating to a faster machine may preclude the use of a passthrough TSC, as a
592*4882a593Smuzhiyunfaster clock cannot be made visible to a guest without the potential of time
593*4882a593Smuzhiyunadvancing faster than usual.  A slower clock is less of a problem, as it can
594*4882a593Smuzhiyunalways be caught up to the original rate.  KVM clock avoids these problems by
595*4882a593Smuzhiyunsimply storing multipliers and offsets against the TSC for the guest to convert
596*4882a593Smuzhiyunback into nanosecond resolution values.
597*4882a593Smuzhiyun
598*4882a593Smuzhiyun4.5. Scheduling
599*4882a593Smuzhiyun---------------
600*4882a593Smuzhiyun
601*4882a593SmuzhiyunSince scheduling may be based on precise timing and firing of interrupts, the
602*4882a593Smuzhiyunscheduling algorithms of an operating system may be adversely affected by
603*4882a593Smuzhiyunvirtualization.  In theory, the effect is random and should be universally
604*4882a593Smuzhiyundistributed, but in contrived as well as real scenarios (guest device access,
605*4882a593Smuzhiyuncauses of virtualization exits, possible context switch), this may not always
606*4882a593Smuzhiyunbe the case.  The effect of this has not been well studied.
607*4882a593Smuzhiyun
608*4882a593SmuzhiyunIn an attempt to work around this, several implementations have provided a
609*4882a593Smuzhiyunparavirtualized scheduler clock, which reveals the true amount of CPU time for
610*4882a593Smuzhiyunwhich a virtual machine has been running.
611*4882a593Smuzhiyun
612*4882a593Smuzhiyun4.6. Watchdogs
613*4882a593Smuzhiyun--------------
614*4882a593Smuzhiyun
615*4882a593SmuzhiyunWatchdog timers, such as the lock detector in Linux may fire accidentally when
616*4882a593Smuzhiyunrunning under hardware virtualization due to timer interrupts being delayed or
617*4882a593Smuzhiyunmisinterpretation of the passage of real time.  Usually, these warnings are
618*4882a593Smuzhiyunspurious and can be ignored, but in some circumstances it may be necessary to
619*4882a593Smuzhiyundisable such detection.
620*4882a593Smuzhiyun
621*4882a593Smuzhiyun4.7. Delays and precision timing
622*4882a593Smuzhiyun--------------------------------
623*4882a593Smuzhiyun
624*4882a593SmuzhiyunPrecise timing and delays may not be possible in a virtualized system.  This
625*4882a593Smuzhiyuncan happen if the system is controlling physical hardware, or issues delays to
626*4882a593Smuzhiyuncompensate for slower I/O to and from devices.  The first issue is not solvable
627*4882a593Smuzhiyunin general for a virtualized system; hardware control software can't be
628*4882a593Smuzhiyunadequately virtualized without a full real-time operating system, which would
629*4882a593Smuzhiyunrequire an RT aware virtualization platform.
630*4882a593Smuzhiyun
631*4882a593SmuzhiyunThe second issue may cause performance problems, but this is unlikely to be a
632*4882a593Smuzhiyunsignificant issue.  In many cases these delays may be eliminated through
633*4882a593Smuzhiyunconfiguration or paravirtualization.
634*4882a593Smuzhiyun
635*4882a593Smuzhiyun4.8. Covert channels and leaks
636*4882a593Smuzhiyun------------------------------
637*4882a593Smuzhiyun
638*4882a593SmuzhiyunIn addition to the above problems, time information will inevitably leak to the
639*4882a593Smuzhiyunguest about the host in anything but a perfect implementation of virtualized
640*4882a593Smuzhiyuntime.  This may allow the guest to infer the presence of a hypervisor (as in a
641*4882a593Smuzhiyunred-pill type detection), and it may allow information to leak between guests
642*4882a593Smuzhiyunby using CPU utilization itself as a signalling channel.  Preventing such
643*4882a593Smuzhiyunproblems would require completely isolated virtual time which may not track
644*4882a593Smuzhiyunreal time any longer.  This may be useful in certain security or QA contexts,
645*4882a593Smuzhiyunbut in general isn't recommended for real-world deployment scenarios.
646