sched: New scheduler algorithm

Operating Systems / OSv - Nadav Har'El [cloudius-systems.com] - 26 November 2013 05:32 UTC

This patch replaces the algorithm which the scheduler uses to keep track of threads' runtime, and to choose which thread to run next and for how long.

The previous algorithm used the raw cumulative runtime of a thread as its runtime measure. But comparing these numbers directly was impossible: e.g., should a thread that slept for an hour now get an hour of uninterrupted CPU time? This resulted in a hodgepodge of heuristics which "modified" and "fixed" the runtime. These heuristics did work quite well in our test cases, but we were forced to add more and more unjustified heuristics and constants to fix scheduling bugs as they were discovered. The existing scheduler was especially problematic with thread migration (moving a thread from one CPU to another) as the runtime measure on one CPU was meaningless in another. This bug, if not corrected, (e.g., by the patch which I sent a month ago) can cause crucial threads to acquire exceedingly high runtimes by mistake, and resulted in the tst-loadbalance test using only one CPU on a two-CPU guest.

The new scheduling algorithm follows a much more rigorous design, proposed by Avi Kivity in: https://docs.google.com/document/d/1W7KCxOxP-1Fy5EyF2lbJGE2WuKmu5v0suYqoHas1jRM/edit?usp=sharing

To make a long story short (read the document if you want all the details), the new algorithm is based on a runtime measure R which is the running decaying average of the thread's running time. It is a decaying average in the sense that the thread's act of running or sleeping in recent history is given more weight than its behavior a long time ago. This measure R can tell us which of the runnable threads to run next (the one with the lowest R), and using some highschool-level mathematics, we can calculate for how long to run this thread until it should be preempted by the next one. R carries the same meaning on all CPUs, so CPU migration becomes trivial.

The actual implementation uses a normalized version of R, called R'' (Rtt in the code), which is also explained in detail in the document. This Rtt allows updating just the running thread's runtime - not all threads' runtime - as time passes, making the whole calculation much more tractable.

The benefits of the new scheduler code over the existing one are:

1. A more rigourous design with fewer unjustified heuristics.

2. A thread's runtime measurement correctly survives a migration to a different CPU, unlike the existing code (which sometimes botches it up, leading to threads hanging). In particular, tst-loadbalance now gives good results for the "intermittent thread" test, unlike the previous code which in 50% of the runs caused one CPU to be completely wasted (when the load- balancing thread hung).

3. The new algorithm can look at a much longer runtime history than the previous algorithm did. With the default tau=200ms, the one-cpu intermittent thread test of tst-scheduler now provides good fairness for sleep durations of 1ms-32ms. The previous algorithm was never fair in any of those tests.

4. The new algorithm is more deterministic in its use of timers (with thyst=2_ms: up to 500 timers a second), resulting in less
varied performance in high-context-switch benchmarks like tst-ctxsw.

This scheduler does very well on the fairness tests tst-scheduler and fairly well on tst-loadbalance. Even better performance on that second test will require an additional patch for the idle thread to wake other cpus' load balanacing threads.

As expected the new scheduler is somewhat slower than the existing one (as we now do some relatively complex calculations instead of trivial integer operations), but thanks to using approximations when possible and to various other optimizations, the difference is relatively small:

On my laptop, tst-ctxsw.so, which measures "context switch" time (actually, also including the time to use mutex and condvar which this test uses to cause context switching), on the "colocated" test I measured 355 ns with the old scheduler, and 382 ns with the new scheduler - meaning that the new scheduler adds 27ns of overhead to every context switch. To see that this penalty is minor, consider that tst-ctxsw is an extreme example, doing 3 million context switches a second, and even there it only slows down the workload by 7%.

dbc0d50 sched: New scheduler algorithm
core/sched.cc | 404 ++++++++++++++++++++++++++++++++++++++++++++----------
include/sched.hh | 97 ++++++++++++-
2 files changed, 421 insertions(+), 80 deletions(-)

Upstream: github.com


  • Share