Credit-Based CPU Scheduler
The credit scheduler is a proportional fair share CPU scheduler built from the ground up to be work conserving on SMP hosts. It is now the default scheduler in the xen-unstable trunk. The SEDF and BVT schedulers are still optionally available but the plan of record is for them to be phased out and eventually removed.
Each domain (including Host OS) is assigned a weight and a cap.
A domain with a weight of 512 will get twice as much CPU as a domain with a weight of 256 on a contended host. Legal weights range from 1 to 65535 and the default is 256.
The cap optionally fixes the maximum amount of CPU a domain will be able to consume, even if the host system has idle CPU cycles. The cap is expressed in percentage of one physical CPU: 100 is 1 physical CPU, 50 is half a CPU, 400 is 4 CPUs, etc... The default, 0, means there is no upper cap.
SMP load balancing
The credit scheduler automatically load balances guest VCPUs across all available physical CPUs on an SMP host. The administrator does not need to manually pin VCPUs to load balance the system. However, she can restrict which CPUs a particular VCPU may run on using the generic vcpu-pin interface.
Schedule Rate Limiting (added in Xen 4.2)
Hui Lv over at Intel did some fascinating work analyzing the performance overhead of running SpecVirt inside of Xen. What he discovered was that under certain circumstances, some VMs were waking up, doing a few microseconds worth of work, and going back to sleep, only to be woken up microseconds later. The credit1 scheduler correctly identified that these were probably latency-sensitive applications and gave them priority to run whenever they needed to. The problem was that they were causing thousands of schedules per second — in some cases up to 15,000 schedules per second. This meant that there was a very significant amount of time actually spent in the scheduler switching back and forth between the two tasks, rather than doing the actual work.
The credit1 scheduler was giving these processes microsecond latency; but there are very few network-based workloads that require sub-millisecond latency. Hui worked with the Xen development community to introduce a simple mechanism that would be predictable and robust, and improve the performance for this workload without degrading the performance of other workloads. The result was ratelimit_us. The ratelimit is a value in microseconds. It is a minimum amount of time which a VM is allowed to run without being preempted. The default value is 1000 (that is, 1ms). So if a VM starts running, and another VM with higher priority wakes up, if the first VM has run for less than 1ms, it is allowed to continue to run until its 1ms is up; only after that will the higher-priority VM be allowed to run.
One millisecond is not generally too long for network-based workloads to wait; and the effect is to have more “batching”, so the whole system is used more effectively. This caused significant increase in SpecVirt performance in Hui’s tests.
This feature can be disabled by setting the ratelimit to 0. One could imagine, on a computation-heavy workload, setting this to something higher, like 5ms or 10ms; or if you have a particularly latency-sensitive workload, bringing it down to 500us or even 100us.
This value can be set either on the Xen command-line using sched_ratelimit_us (Note no “credit” in this one — it’s meant to be consumed by other schedulers as well) or the xl command-line:
# xl sched-credit -r [n]
Curent values of both can also be viewed using xl:
# xl sched-credit Cpupool Pool-0: tslice=30ms ratelimit=1000us Name ID Weight Cap Domain-0 0 256 0
Timeslice (added in Xen 4.2)
The timeslice for the credit1 scheduler by default is fixed at 30ms. This is actually a fairly long time — it’s great for computationally-intensive workloads, but not so good for latency-sensitive workloads, particularly ones involving network traffic or audio.
Xen 4.2 introduces the tslice_ms parameter, which sets the timeslice of the scheduler in milliseconds. This can be set either using the Xen command-line option, sched_credit_tslice_ms, or by using the new scheduling parameter interface to xl sched-credit:
# xl sched-credit -t [n]
Interesting values you might give try are 10ms, 5ms, and 1ms. One millisecond might be a good choice for particularly latency-sensitive workloads; but beware that reducing the timeslice also increases the overhead from context switching and reduces the effectiveness of the CPU cache. Values of 5ms or 10ms give a good balance. The default, 30ms, is probably too long; but we’re going to do some more experimentation and probably switch the default in 4.3. If you try any values that turn out to be particularly good or bad, let us know.
The xm sched-credit command may be used to tune the per VM guest scheduler parameters.
|xm sched-credit -d <domain>|
|xm sched-credit -d <domain> -w <weight>|
|xm sched-credit -d <domain> -c <cap>|
Each CPU manages a local run queue of runnable VCPUs. This queue is sorted by VCPU priority. A VCPU's priority can be one of two value: over or under representing wether this VCPU has or hasn't yet exceeded its fair share of CPU resource in the ongoing accounting period. When inserting a VCPU onto a run queue, it is put after all other VCPUs of equal priority to it.
As a VCPU runs, it consumes credits. Every so often, a system-wide accounting thread recomputes how many credits each active VM has earned and bumps the credits. Negative credits imply a priority of over. Until a VCPU consumes its alloted credits, it priority is under.
On each CPU, at every scheduling decision (when a VCPU blocks, yields, completes its time slice, or is awaken), the next VCPU to run is picked off the head of the run queue. The scheduling decision is the common path of the scheduler and is therefore designed to be light weight and efficient. No accounting takes place in this code path.
When a CPU doesn't find a VCPU of priority under on its local run queue, it will look on other CPUs for one. This load balancing guarantees each VM receives its fair share of CPU resources system-wide. Before a CPU goes idle, it will look on other CPUs to find any runnable VCPU. This guarantees that no CPU idles when there is runnable work in the system.
Glossary of Terms
- ms: millisecond
- Host: The physical hardware running Xen and hosting guest VMs.
- VM: guest, virtual machine.
- VCPU: Virtual CPU (one or more per VM)
- CPU/PCPU: Physical host CPU.
- Tick: Clock tick period (10ms)
- Time-slice: The time-slice a VCPU receives before being preempted to run another (30ms)
- Period: The accounting period (30ms). Once per period, credits earned are recomputed.
- Weight: Proportional share of CPU per guest VM
- Cap: An optional upper limit on the CPU time consumable by a particular VM.