Credit is a (weighted) proportional fair share virtual CPU scheduler. It was the first Xen scheduler thought from the beginning to be fully work conserving on SMP hosts. Each virtual machine is assigned a weight and a cap. A cap of 0 puts the VM in work-conserving mode. A non-zero cap means the vCPUs of the VM will not run above a certain amount of CPU time, even if the system is idle (non-work conserving mode).
It is quantum based, and the timeslice is 30 ms. This (roughly) means that a vCPU can run for up to 30 ms before being preempted by another vCPU. That is nowadays a rather long interval of time, but:
- it was not that long at the time when Credit was designed and implemented,
- less frequent preemption is good for throughput of CPU bound workload.
Credit tries to compensate for its long timeslice by giving I/O intensive vCPU a priority boost. Roughly speaking, this means that vCPUs that wakes up after having been waiting for I/O, will likely get to run immediately.
Credit is the default scheduler of Xen. It is providing satisfactory performance, for a lot of workloads. In case applications in need of low latancies (some class of networking applications, audio, etc.) suffers, a potential mitigation would be to change the timeslice (see below).
Global Scheduler Parameters
Timeslice (also known, in other contexts, as scheduling quantum) is for how long a vCPU can run before the scheduler itself chimes in, and if a preemption should occur. And if yes, some other vCPU is put into execution.
A long timeslice is usually good for achieving high throughput of CPU intensive workloads, as it prevents context switches to happen too frequently, which may lead to trashing of CPU and cache(s). The best timeslice value, though, is highly workload dependant. Credit has, by default, a timeslice of 30ms, which can be considered a faiirly long.
In Xen 4.2, we introduced the
tslice_ms scheduler parameter. This can be set either using the Xen command-line option, sched_credit_tslice_ms, or, at run time, with
# xl sched-credit -t [n]
Possible good values may be 10ms, 5ms, and 1ms, with smaller values allegedly being better suited for latency-sensitive workloads, but at the cost of increased the overhead from context, and reduced CPU cache effectiveness.
The default value of 30ms is universally recognised as being anachronistically too high. There has been attempt to change| it to something smaller, but, unfortunately, because of the intrinsic characteristics of the Credit algorithm, changing timeslice has some easily predictable side effects, so the change was pushed back.
Therefore, using a different (smaller?) timeslice value may be potentially beneficial for a particular workflow, but that can only be assessed by experimentations.
Context-Switch Rate Limiting
There may be cases where interrupt intensive workloads (i.e., an interrupt wakes up a VM, which does a few microseconds work, and goes back to sleep), coupled with the boosting of vCPUs doing I/O enacted by Credit, causes thousands of scheduler invocation per second. done by Intel on the SpecVirt benchmark found out that there may be up to 15,000 schedules per second.
We therefore introduced context-switching rate limiting, configured via the
ratelimit_us parameter. The ratelimit is a value in microseconds. It is a minimum amount of time which a VM is allowed to run without being preempted. The default value is 1000 (that is, 1ms). So if a VM starts running, and another VM with higher priority wakes up, if the first VM has run for less than 1ms, it is allowed to continue to run until its 1ms is up; only after that will the higher-priority VM be allowed to run.
One millisecond is not generally too long for network-based workloads to wait; and the effect is to have more “batching”, so the whole system is used more effectively. This caused significant increase in SpecVirt performance in Hui’s tests.
This feature can be disabled by setting the ratelimit to 0. One could imagine, on a computation-heavy workload, setting this to something higher, like 5ms or 10ms; or if you have a particularly latency-sensitive workload, bringing it down to 500us or even 100us.
This value can be set either on the Xen command-line using sched_ratelimit_us (Note no “credit” in this one — it’s meant to be consumed by other schedulers as well) or the xl command-line:
# xl sched-credit -r [n]
Curent values of both can also be viewed using xl:
# xl sched-credit Cpupool Pool-0: tslice=30ms ratelimit=1000us Name ID Weight Cap Domain-0 0 256 0
VM Scheduling Parameters
Each domain (including Host OS) is assigned a weight and a cap.
A domain with a weight of 512 will get twice as much CPU as a domain with a weight of 256 on a contended host. Legal weights range from 1 to 65535 and the default is 256.
The cap optionally fixes the maximum amount of CPU a domain will be able to consume, even if the host system has idle CPU cycles. The cap is expressed in percentage of one physical CPU: 100 is 1 physical CPU, 50 is half a CPU, 400 is 4 CPUs, etc... The default, 0, means there is no upper cap.
NB: Many systems have features that will scale down the computing power of a CPU that is not 100% utilized. This can be in the operating system, but can also sometimes be below the operating system, in the BIOS. If you set a cap such that individual cores are running at less than 100%, this may have an impact on the performance of your workload over and above the impact of the cap. For example, if your processor runs at 2GHz, and you cap a vm at 50%, the power management system may also reduce the clock speed to 1GHz; the effect will be that your VM gets 25% of the available power (50% of 1GHz) rather than 50% (50% of 2GHz). If you are not getting the performance you expect, look at performance and cpufreq options in your operating system and your BIOS.
The xm sched-credit command may be used to tune the per VM guest scheduler parameters.
|xm sched-credit -d <domain>|
|xm sched-credit -d <domain> -w <weight>|
|xm sched-credit -d <domain> -c <cap>|
Each CPU manages a local run queue of runnable VCPUs. This queue is sorted by VCPU priority. A VCPU's priority can be one of two value: over or under representing wether this VCPU has or hasn't yet exceeded its fair share of CPU resource in the ongoing accounting period. When inserting a VCPU onto a run queue, it is put after all other VCPUs of equal priority to it.
As a VCPU runs, it consumes credits. Every so often, a system-wide accounting thread recomputes how many credits each active VM has earned and bumps the credits. Negative credits imply a priority of over. Until a VCPU consumes its alloted credits, it priority is under.
On each CPU, at every scheduling decision (when a VCPU blocks, yields, completes its time slice, or is awaken), the next VCPU to run is picked off the head of the run queue. The scheduling decision is the common path of the scheduler and is therefore designed to be light weight and efficient. No accounting takes place in this code path.
When a CPU doesn't find a VCPU of priority under on its local run queue, it will look on other CPUs for one. This load balancing guarantees each VM receives its fair share of CPU resources system-wide. Before a CPU goes idle, it will look on other CPUs to find any runnable VCPU. This guarantees that no CPU idles when there is runnable work in the system.
The Credit scheduler uses 30 ms time slices for CPU allocation. A VM (VCPU) receives 30 ms before being preempted to run another VM. Once every 30 ms, the priorities (credits) of all runnable VMs are recalculated.
The scheduler monitors resource usage every 10 ms. To some degree, Credit’s computation of credits resembles virtual time computation in BVT. However, BVT has a context switch allowance C for defining a different size of the basic time slice (time quantum), and an additional low-latency support (via warp) for real-time applications.
SMP load balancing
The credit scheduler automatically load balances guest VCPUs across all available physical CPUs on an SMP host. The administrator does not need to manually pin VCPUs to load balance the system. However, she can restrict which CPUs a particular VCPU may run on using the generic vcpu-pin interface.
Before a CPU goes idle, it will consider other CPUs in order to find any runnable VCPU. This approach guarantees that no CPU idles when there is runnable work in the system.