Going back to the basics... kinda.
(Or: what you finally learn to do after a number of false starts)
But first, the obligatory disclaimer, in case you missed it. Thanks.
After I admitted the basic flaws of my previous attempts, I decided to go back to the basics and use a “bottom-up” approach to come up with a solution that suited the needs of the Haiku scheduler: fairness, responsiveness, responsive, scalability, succinctness and efficiency.
Right. So let’s keep in mind that the BeBook already hints at what the priority numbers should roughly mean: frequency of scheduling. We have 120 priority levels; the higher 20 are reserved to real-time threads.
This post is not going to address these directly. Instead, today I’ll be telling the background story of how quite a number of pieces are being woven together to fulfill those specific needs and constraints. OK, enough metablogging ;)
After letting go of the previous approach, I decided to study the subject on the very traditional literature about it. And while I was studying classic scheduler approaches on the Minix book, I noticed that the description of lottery-based scheduling was very compatible with that description of priorities meaning approximate frequency of being chosen by the scheduler. Lottery scheduling was proposed by Carl Waldspurger in 1994, and it works roughly like this[1]:
Let’s use a supply vs. demand analogy. Think of a very popular store you know, and suppose there’s a big sale going on. Only one person can go inside the store at any time, and only when someone else gets out of it. The people outside refused to form a line, but they accepted participating in a lottery-like system, where numbers are drawn randomly and the person holding the winning ticket gets inside the store.
Suppose you could assign different priorities to people with special needs; for example, the elderly and the pregnant could receive more than one ticket to increase their chances of going inside sooner than others.
Seems fair, right? So let’s go back to computers. Suppose there exists some resource that is available in limited quantities (like CPU power, I/O bandwidth, network bandwidth, and so on). Suppose there are several processes competing for those resources, and they can have different priorities. Now, suppose that you could assign those processes “tickets” that can be used to organise how they will be served.
A ticket corresponds to slot of time during which a process can make use of those limited resources. Like in a lottery, the ticket number is drawn at random, and the process with the winning ticket is allowed to use the resource. Suppose it’s possible to give many tickets to process; it will then be allowed to use those limited resources proportionally to the number of tickets it holds. So processes holding 20 tickets will be allowed inside 20 times, process 10 tickets, 10 times; and so on.
Except that those tickets don’t expire, and are good for as many trips inside as the process wishes. This effectively makes it so that, during their lifetime, a process holding 20 tickets will be allowed inside twice as many times as one holding 10 tickets. As time goes by, the processes will have used the resource proportionally to the number of tickets they hold. Cool, eh? Priorities can be easily implemented by simply handing a process a number of tickets proportional to what its priority should be.
I had to stop for a moment to really appreciate how elegant this is. I mean it; I just felt like staring at the void for a good time until things clicked together. It seems that people have no problem understanding priorities when dealing with an one-shot event (like a queue on the bank where the elderly, the pregnant etc have higher priority to be attended to by the cashier), but priorities when dealing with cyclical events are not this straightforward. The idea of being cyclically served, with the cycles expressed as temporal frequencies, and noticing that the ratios between frequencies can be understood as differences in priority, is very simple, very powerful, and very elegant. The concept of fairness stems naturally from it.
Not to mention how plainly intuitive this is. Isn’t it great to have something palpable to relate to when discussing priorities? Isn’t it better to think in terms of physical tickets instead of cold numbers?
Well, great!
I really wished things were just this simple.
So far, I've been trying to avoid sounding too technical, because until now most of the feedback I received came from a non-technical audience. Sorry, folks, I can't avoid going a little deeper into the technical details right now. But I'll try my best not to make things sound too hairy.
Lottery scheduling sounds very nice in theory, but there are some problems. The most obvious problem is the randomness: the first several rounds of scheduling will resemble anything but the proportion we’d like to attain. We only observe the desired ratios in the long run.
A possible solution to this was proposed by Waldspurger himself in 1995, when he introduced the Stride Scheduling. In a nutshell, he replaced random lotteries with strides and passes. All processes had a property called “pass” and a property called “stride”. Strides are inversely proportional to the priority. Each time a process is chosen, its pass is increased by stride. The scheduler always picks the process with the lowest pass[2].
The second problem lies in the mapping between tickets and processes. Searching for a ticket by visiting every process is very inefficient. We could use hash tables, or trees, to do this indexing. I’ll tackle this in a following post.
The third problem is that one of the requirements of Haiku’s scheduler is that it should have constant (O(1)) complexity regarding the number of processes (threads, in our case). I didn’t manage to find[3] whether Waldspurger’s work touches this issue at all, but at least one implementation used a tree to organise the threads. And there was this thing called “Hierarchical Stride Scheduling” on his thesis and the sound of those words ringed some bells here. I’ll address this one right in the next post.
The fourth and worst problem is that both lottery- and stride-based scheduling are great for processes that want to use as much of the limited resource as possible during their time share. In the case of CPU power, those would be CPU-bound threads. BeOS and Haiku are desktop operating systems, where interactive threads are prevalent. This means that there will be plenty of threads that spend most of their time waiting for user input instead of using the CPU.
Uh, quite a handful of tricky problems. Kind of discouraging, no?
But wait, all is not wasted ;) There are solutions for all those problems. I have come up with some twists that I'd like to share with you.
I'm convinced that even if the stride scheduling as expressed in Waldspurger's thesis could be less than optimal for interactive tasks (i.e., it doesn't approach the static version of shortest-remaining-time scheduling, which is proven to provide the best possible average response times), there's no doubt in my mind that the idea of using strides results in perfect fairness when all tasks are equal. One can actually prove this mathematically. Not necessarily his stride + pass approach, though. Simple strides will do.
For example, for modeling purposes, suppose you have an array where you store a task T as many times as T’s priority is (let’s call this the temporal frequency property, or TFP). So let task A have priority 1, task B have priority 2, task C have priority 3. One possible array where the TFP holds is A B B C C C but so does any permutation of this array. If you multiply the occurrences of every element by a constant factor, for example 2, the array becomes A A B B B B C C C C C C. The TFP still holds for this array, as well as for any possible permutation of it. This is the key of the discussion that follows, so please take a moment to convince yourself that this is true.
It turns out that for any array for which the TFP holds, it’s possible to reduce it to the “trivial” array by ordering the array by task size (i.e., making it “non-permuted”) and “scaling” it so that the fewest occurring element appears divided by Greatest_Common_Divisor(A.prio, B.prio, C.prio, …) times.
In other words, doing exactly what we do when reducing fractions. Eh. Sorry if that sounded scary.
Nice, so now we have an invariant :) So, for clarity, let’s keep the trivial array, A B B C C C.
Ideally, we’d like those three tasks to be run with a fixed temporal frequency, to ensure fairness and to have smooth response times. If we were to access this array sequentially, the ideal permutation of this array would be C B A C B C or some slight variation of it. However, finding such permutation when the array is much bigger than this one (which would be the common situation in our case) is not a cheap operation.
However, we can do some math tricks to try to approach this ideal situation. In Computer Science, those kinds of tricks are called heuristic algorithms.
Notice that any strided, modulo size of the array (which is identical to the sum of all priorities), traversal of this array, where the stride is coprime to the size of the array (let’s call it Good Stride Property), will return a given task proportionally to the number of times it appears on the array, divided by the size of the array (i.e., the sum of the priorities of all tasks). (Alternatively, if we don’t scale the array, it’s enough to calculate the stride using the trivial array, and then scale the stride using the GCD.)
So in our example A will appear 1/6 of the times, B, 2/6 (1/3) of the times, and C, 3/6 (1/2) of the times. For any stride for which the GSP holds.
Trivial case: stride = 1. Traversing the array will output A, B, B, C, C, C, A, B, B, … Next case, stride = 2: not coprime, GSP doesn’t hold. Would output A, B, C, A, B, C…, and the proportions are gone. Next case, stride = 3. not coprime, GSP doesn’t hold, would output A, C, A, C, A, C…, and B doesn’t even show!! Stride 4: A, C, B, A, C, B, … Not coprime, GSP doesn’t hold, same situation as stride = 2. Stride 5: A, C, C, C, B, B, … OK. Stride 6: not coprime (doh).
(well, unfortunately 3 tasks with respective priorities of 1, 2 and 3 are a terrible example, as their sum is 6, only 1 and 5 are coprime to 6, and the modular distance of both 1 and 5 to 6 is 1… But I chose them on purpose. Because they’re simple enough to make the example clear, and also because this already proves that there’s no algorithm that will return a stride that displays this even-ness property in every possible case, for the simple fact that such stride might not even exist.)
(And notice that actually it would suffice to test only half of the possible strides smaller than sum_of_priorities (let’s call it SoP from now on), because since the strides are modulo SoP, the second half is just mirroring the first half.)
Anyway, just trust me on this one. Picking a good stride (when available) does ensue a distribution of tasks in time that is more evenly spaced, but even in the worst case, the correct proportions will have to emerge, by construction, after SoP scheduling rounds.
Now, what if we add a new task? Should we abandon the current strided traversal and start anew? Well, I still haven’t decided what to do here. So far I considered starting anew but with a different stride, starting from a random position. Perhaps even taking the opposite direction (subtracting strides if they were being added in the previous scheduling round, and vice-versa). This should eventually converge to a fair distribution. Simply finishing the current strided traversal until it completes SoP scheduling rounds and ignoring new tasks until then would be terrible for responsiveness when the number of tasks grow. I believe using the rule of thumb I described above as a compromise should be reasonable enough, but I’m open for suggestions.
I hope that by now you're feeling that the pieces are finally starting to fall together. And they are! I'll try to cut the math abstractions to the minimum on the next blog post. It will be more algorithm-oriented.
See you then! ;)
Notes:
- I'm aware that both lottery scheduling and stride scheduling were implemented in the Linux kernel several times before as academic exercises. While there's no doubt that fairness and multimedia responses improved, nobody (that I'm aware of) made an effort to reimplement these ideas on the 2.6 kernel series. Actually, I'm not being completely honest here. Those ideas did make a comeback in recent versions of the 2.6 Linux kernel series. I'll get to it in due time. Stay tuned!
- Linux followers: does that ring a bell? *hint*
- OK, I have to admit that I haven't really looked deep into that, but I didn't find anything that suggested he himself considered discussing the complexity regarding the number of threads in the system of his implementation. But the fact is, since the most complete implementation made for the Linux kernel so far used a tree, you can already assume that there's at least the possibility of such complexity being O(lg n) where n is the number of threads. But rest assured, my solution is even better. Or so I guess ;)
Don't miss the next chapter in the Scheduler saga! Tomorrow (hopefully), same bat time... Same bat channel (powered by RSS feeds!)... Right here. Stay tuned.