Welcome to the Modern Embedded Systems Programming course. My name is Miro Samek and in this fifth lesson on RTOS I'll finally address the real-time aspect in the "Real-Time Operating System" name. Specifically, in this lesson you will augment the MiROS RTOS with a preemptive, priority-based scheduler, which can be mathematically proven to meet real-time deadlines under certain conditions. As usual, let's get started by making a copy of the previous lesson 25 directory and renaming it to lesson 26. Get inside the new lesson 26 directory and double-click on the uVision project "lesson" to open it. To remind you quickly what happened so far, in the last lesson you added efficient blocking delay to the MiROS RTOS. However, the RTOS is not yet worthy of it's name, because the round-robin scheduler is not really "Real-Time" yet. So, let's start today with introducing the concept of "Real-Time". Until now, in all your code, any computation that was performed was considered equally useful, and you cared only whether the computation was correct, but not whether it was performed within a given time. "Real-Time" adds the timeliness requirement to the computations. Specifically, a computation that is performed too late (or too early for that matter) is considered less useful and even harmful as an outright wrong computation. Here is a graphical representation of the "real-time" usefulness of a computation as a function of time. In the so called "hard real-time systems", a computation is useful from the triggering event up to the deadline. After the deadline, the usefulness of the computation becomes "negative infinity", which means that the computation is worse that useless (for that its usefulness would be merely zero). A missed deadline represents a system failure. For example, deploying an air-bag too late is not just useless--it is disastrous. But there are also "soft real-time systems", where timeliness is also important, but the deadline is not as firm. For example, a text message is expected to be delivered somewhat timely, say within 20 seconds. But it is still useful much longer, although its usefulness diminishes with time. In the following discussion, I will focus on hard real-time systems. From the historical perspective, mainstream use of computers in hard real-time systems started in the 1960's and 70's. For example, the Apollo program used two identical real-time computers: one in the command module and another in the lunar module. Here is a picture of Margaret Hamilton, who led much of this effort, with the stack of code listings showing how much real-time software was created during the 1960's and early 70's. Already in these early days, people realized that most real-time systems operate in a periodic fashion, meaning that the triggering events and the deadlines repeat with a certain period. For example, landing a spacecraft on the lunar surface required fine corrections to the rocket thrusters that the on-board computer had to perform every few milliseconds. Similarly, industrial processes require periodic control ranging from milliseconds to tens of seconds. Electronic control of combustion engines requires periodic control synchronized with the revolutions of the engine, and so on. To better understand how the real-time concepts apply to periodic threads, let's perform a few experiments with your current version of the MiROS RTOS. Specifically, you can increase the system clock tick rate to 1000 times per second, that is, one clock tick per millisecond. Next, you can modify your binky1 and blinky2 threads so that they actually put some computational load on your CPU, because right now they only turn the LED on or off, which takes a microsecond, after which the thread blocks and relinquishes the CPU. In order to simulate a more realistic CPU loading, you can turn the LED on and off not once, but a few thousand times in a for-loop. This particular for-loop, will take about 1.2 milliseconds to execute, which is intentionally slightly longer than your system clock tick. After the for-loop, the blinky1 thread delays for 1 clock tick, which will block until the very next SysTick interrupt. This means that the body of the wile(1)-loop will repeat with the period of 2 milliseconds. At this point, let me introduce a few symbols, commonly used in the literature to characterize a periodic real-time thread, like your Blinky1: The time of computation of Thread1 is denoted as C1 and is equal to 1.2 milliseconds. The period of Thread1 is denoted T1 and is equal to 2 milliseconds. And finally, the processor utilization of Thread1 is denoted U1 and is the ratio between the time of computation and the period. This is the percentage of the time the CPU spends executing Thread1, and is 60 percent in this case. Similarly to Blinky1, you can modify the Blinky2 thread to simulate a CPU load that runs 3 times longer than Blinky1, that is for about 3.6 milliseconds. After the for-loop, the blinky2 thread will delay for 50 milliseconds, so that its total period will be about 54 milliseconds. In the diagram, the computation time of Blinky2, C2, will be 3.6 milliseconds, and its period, T2, will be 54 milliseconds. The CPU utilization of Blinky2, U2, will be 6.6 percent. One last thing before building the code is to comment out the Wait-For-Interrupt instruction in the OS_onIdle() callback. This will prevent the CPU from stopping and will allow you to see the toggling of the Red LED from the idle thread in the logic analyzer view. The code builds correctly, so let's load it into your TivaC LauchPad board and see how it runs using a logic analyzer. The top trace, labeled ISR corresponds to the SysTick_Halder, and it repeats every 1 millisecond. This is your fast 1000 times per second system clock tick. The trace immediately below, labeled T1 corresponds to Blinky1 and most of the time it runs for about 1.2 milliseconds. It repeats every 2 milliseconds. The trace below, labeled T2 corresponds to Blinky2. This thread runs for about 3 to 4 milliseconds, not counting the gaps. T2 repeats every 55 milliseconds. Finally, the bottom trace, labeled IDL corresponds to the idle thread, and it runs only when no other thread or ISR is running. But the most interesting part of the logic analyzer view is the time when Blinky1 and Blinky2 are both unblocked and are ready to run at the same time. When you zoom in, you can see that Blinky1 starts running, but is scheduled out on the next clock tick, at which point Blinky2 is scheduled to run. Blinky2 also runs only for one clock tick, and only then Blinky1 is scheduled in again to finish the processing, after which it blocks, so Blinky2 runs for the rest of the time slice. On the subsequent clock tick, Blinky1 becomes ready to run again, but notice that this is the THIRD tick from the previous activation, which means that Blinky1 misses its TWO millisecond deadline by one tick. This happens, because the current scheduler inside the MiROS RTOS is still round-robin. This scheduler was designed for timesharing systems, where the most important goal was to share the CPU FAIRLY among all threads. Therefore, the scheduler takes the CPU away from Blinky1 after it exhausts its time slice. But this is NOT what you want in a hard-real time system. For example, suppose that Blinky1 represents an important thread that controls lunar module's descend to the Moon's surface. It HAS to run every 2 milliseconds, and missing even one of these deadlines can cause a catastrophic system failure. In that case, you don't care about the fairness of CPU assignment. You care about meeting your 2 millisecond hard real-time deadline. For this you need a different scheduler, which somehow KNOWS about the importance of the threads, so that it can execute a more important thread before executing a less-important thread. Today, you will implement the simplest version of such a real-time scheduler, called priority-based, preemptive scheduler with static priorities. This means that each thread will be assigned a unique PRIORITY number when it is started and this priority will not change thereafter. The job of the scheduler will be to always run the the highest-priority thread that is ready-to-run. Let's see in a timing diagram, how your Blinky threads would execute under such a priority-based scheduler and how the timing would differ from the current round-robin scheduler. As long as the thread T1 remains the only one ready-to-run, because T2 is blocked inside the OS_deleay() function, the execution is identical under both schedulers. But as soon as T2 becomes ready-to-run, the round-robin scheduler suspends T1 and schedules T2. Already at this point T1 looses its chance to meet the 2 millisecond deadline. In contrast, the priority-based scheduler also sees that T2 is becoming ready-to-run, but T1 has a higher PRIORITY, so the scheduler chooses T1 to run. Therefore, T1 continues and easily meets its 2 millisecond deadline. The T2 thread is scheduled only after T1 voluntarily BLOCKS in the OS_delay() function. But as soon as T1 is unblocked again, the scheduler immediately switches back to T1, because it has higher PRIORITY than T2. The situation repeats as long as T2 eventually blocks as well. But interestingly, notice that even though T2 is significantly delayed by constant interruptions from T1, it too eventually completes and blocks way before its deadline of 54 milliseconds. This means that the priority-based scheduler executes T1 AND T2 in such a way that they BOTH meet their hard real-time deadlines. So now, let's actually implement the priority-based scheduler in the MiROS RTOS. The first thing is to bump up the version number and add the thread priority to the thread control block and to the OSThread_start() function. The priority number will be a small integer in the range from 0 to the number of supported threads, which is 32 in the MiROS RTOS, so it will comfortably fit in a uint8_t type. In the RTOS implementation, you need to also bump up the version number and add the priority parameter to the OSThread_start() function. Here, you also need to re-design the use of the OS_thread[] array. As you perhaps recall from the last lesson 25, the OS_thread[] array holds pointers to all thread objects that have been started. In the round-robin scheduler, the array was filled consecutively from index 0, which is reserved for the idle thread, up to the index OS_threadNum. For the priority-based scheduler, the difference will be that indexes into the OS_thread[] array will be the thread PRIORITIES, and that these priorities don't need to be consecutive, meaning that there could be gaps in the OS_thread[] array. For example, the picture shows the idle thread at priority 0, Blinky2 at priority 2, Blinky1 at priority 5, and ThreadN at priority N. This means that you replace the OS_threadNum index with the priority number passed as the function parameter. Also, you need to save the user-specified priority into the thread control block. However, to avoid overbooking of the OS_thread[] array, you must now make sure that not only the priority level is in range, but also that it is not already used, meaning that OS_thread at the prio index is still a zero-pointer. You can move this assertion to the top of the function and make it into a precondition that you code with the Q_REQUIRE() macro introduced in the last lesson. Precondition means that it must be satisfied by the CALLER of the function, not by the function itself. For example, it is the job of the application programmer to assign a unique priority to each thread. Now, let's think for a minute about the use of the OS_readySet bitmask. For the priority-based scheduling, the most interesting information will be to find the highest-priority thread that is ready-to-run based on the current value of the bitmask. At this point, you need to decide about your preferred priority numbering scheme. For historical reasons, many RTOSes you might encounter, such as NUCLEUS, ThreadX, MicroC/OS, embOS, and others, use the INVERSE priority numbering scheme, where priority 0 corresponds to the highest-priority thread and higher priority numbers correspond to LOWER priority threads. Needless to say, the inverse priority numbering scheme leads to constant confusion when talking about higher and lower thread priorities. But it is of course possible to use a simple, DIRECT priority numbering scheme, where priority 0 corresponds to the idle thread and higher priority numbers correspond to threads of higher priority. The MiROS RTOS will use this simple, DIRECT priority numbering convention. The basic principle of finding the priority number of the highest-priority thread that is ready-to-run will be to count the leading zeros in the OS_readySet bitmask up to the first one-bit and subtracting it from the total number of bits, which is 32. For example, if the Blinky2 thread is the only one ready-to-run, the OS_readySet bitmask would have all zero bits except bit number 1. In this case, the count of leading zeros is 30, which can be converted to the priority number by subtracting it from the total number of bits of 32. If the Blinky1 thread with priority 5 is ready-to-run, then the OS_readySet bitmask will have 27 leading zeros up to the first one-bit, which converts to the priority number of 5 by subtracting it from the total number of bits of 32. In the general case of a thread with priority N being ready-to-run, the count of leading zeros will be 32 minus N, so that the priority number works out again as 32 minus the count of leading zeros in the OS_readySet bitmask. The formula works even for the idle condition of the system, where all 32 bits in the OS_readySet bitmask are zero, which results in priority of zero, that is, the priority of the idle thread. Mathematically, the formula 32-CLZ(x) that finds the most-significant one-bit in the number x, is the integer approximation of the log-base-2 function. And that's why I will code it as the LOG2() macro in the miros.c implementation file. Of course, the algorithm is only as fast as the Count-Leading-Zeros operation, but it turns out that your ARM Cortex-M4 processor supports it in hardware, as the CLZ instruction. You can actually find this instruction in the datasheet of your TivaC microcontroller. To take advantage of the CLZ instruction, you can search the compiler help for CLZ. As you can see, this particular compiler supports it through the intrinsic function __clz(). With the very efficient LOG2() operation, you can now implement the priority-based scheduler. As before, when the scheduler detects the idle condition of the system, it needs to choose the idle thread. But now, you no longer use the OS_currIdx variable, so you just directly set OS_next to OS_thread of zero, which is the idle thread. Otherwise, if some threads are ready-to-run, you use the LOG2() macro with the OS_readySet argument to find the highest-priority thread that is ready-to-run. Please note that the LOG2() macro is guaranteed to produce a number between zero and 32, inclusive, so it can be used directly as an index into the OS_thread[] array without range checking the index. At this point, it is also a good idea to assert that the OS_next pointer is not zero. The final change is re-designing of the OS_tick() and OS_delay() services. This is necessary, because both implementations currently assume that the OS_thread[] array is populated consecutively up to the OS_threadNum level, which is no longer the case. In OS_tick(), instead of iterating over the whole OS_thread[] array with potentially big gaps of unused priority levels, you can leverage the fast LOG2() operation. Specifically, you can introduce a bitmask of delayed threads, which is similar to the OS_readySet, except it holds the delayed threads. While you are at it, you might also remove the no longer needed variables OS_currIdx and OS_threadNum. With the OS_delayedSet bitmask in place, the OS_tick() function will iterate only over the 1-bits in the bitmask, instead of scanning all the bits. But, because you will need to remove the already processed bits from the bitmask, you need to use a temporary bitmask 'workingSet'. You loop as long as the workingSet has some bits set, meaning that some threads are delayed and need processing at this clock tick. You first quickly obtain the number of the highest-order 1-bit in the workingSet using your fast LOG2() macro, and you use it to index into the OS_therad[] array. You save the obtained thread pointer in a temporary variable t. You then assert that the t pointer is not zero, meaning that the thread is in use. And you also assert that the timeout of this thread is not zero, because it must be a delayed thread. Next, you decrement the timeout counter for this thread and if it becomes zero, you make the thread ready to run by setting the corresponding priority bit in the OS_readySet bitmask. At the same time, you remove the same bit from the OS_delayedSet bitmask, because this thread is no longer delayed. Finally, you always remove the same priority bit from the workingSet, because it is now processed. To avoid the apparent code duplication, you can introduce a temporary variable 'bit', like this. In the OS_delay() function, you need to replace the OS_currIdx variable with the priority number of the current thread. In addition to removing the priority-bit from the ready set, the function now also needs to add the same bit to the OS_delayedSet, because this thread is now becoming delayed. And finally, to avoid code duplication, you can introduce a temporary variable 'bit', just like in OS_tick() before. The priority-based scheduler implementation is ready now, so let's try to build the code. The project fails to compile, because the signature of the OSThread_start() function has changed. With the priority-based scheduler, each thread you start needs a unique priority to run at. To remain consistent with the previously used diagrams, let's assign priority 5 to Blinky1 and priority 2 to Blinky2. As you can see, the priorities don't need to be consecutive. What really matters is that they are unique and that priority of Blinky1 is higher than priority of Blinky2. There is still one more compilation error. You now need to add explicit priority zero when starting the idle thread. The code builds cleanly, so let's see how it works. First, I'm sure you are curious about the code generated by the LOG2() macro, so let's set a breakpoint inside the OS_sched() function, where the macro is used. As you can see, the disassembly starts with loading addresses of the OS_thread array and the OS_readySet bitmask and then the whole calculation of the highest-priority thread ready to run is accomplished with just two instructions. The CLZ instruction counts the number of leading zeros in the bitmask and stores it in r0 in just one CPU cycle. Similarly, the RSB instruction--reverse subtract--converts it to the priority-number in r0, also in just one CPU cycle. The resulting priority in this case turns out to be 5, which you should recognize as the priority of Blinky1. Indeed, the OS_next variable is set to the address of the Blinky1 thread. When you finally run the code free you can watch some activity of the LEDs on your board. But to really see how your new priority-based scheduler works, you need to use a logic analyzer. As you can see, the Blinky1 thread runs now completely undisturbed even when Blinky2 becomes ready to run. Blinky1 always meets its deadline and so does Blinky2, even though Blinky2 is preempted by Blinky1 several times. Finally, the idle thread runs as well, but only when none of the other threads or interrupts are active. It's nice to notice that this analyzer trace matches exactly the timeline of the priority-based scheduler you've designed earlier and subsequently implemented in the MiROS RTOS. At this point, I would like you to notice the absolutely critical importance of BLOCKING for the priority-based scheduler. A high-priority thread runs for as long as it wants and no thread of lower-priority can run until the high-priority thread BLOCKS and voluntarily relinquishes the CPU. Without the blocking, NO lower-priority thread will EVER run. For example Blinky2 and the Idle thread run only when Blinky1 blocks and similarly, the Idle thread runs only when BOTH Blinky1 and Blinky2 are blocked. This is very different from the round-robin scheduler, which executed all your blinky threads in succession even when they didn't block at all. So now you know why I had to implement efficient thread blocking in the last lesson 25 before introducing real-time priority-based scheduling in this lesson. With the code working as expected, let's now focus on the most important decision you face when working with the priority-based scheduler, and that is, on how to assign priorities to your threads. With just two threads, like your Blinky1 and Blinky2, you have only two possibilities: priority of Blinky1 higher than Blinky2 and priority of Blinky1 lower than Blinky2. As you just saw, the first option meets both real-time deadlines. On the other hand, it is easy to see that the second possibility of assigning lower priority to Blinky1 than Blinky2 leads to Blinky1 missing its deadline as soon as Bliny2 becomes ready to run. So, the rule that emerges here is to assign higher priorities to threads with shorter periods, which means also shorter deadlines. It turns out that this rule has been discovered already in the 1970s. Specifically, in 1973 C.L.Liu and James W. Layland published a seminal paper titled: "Scheduling Algorithms for Multiprogramming in Hard-Real-Time Environment". I provide a web link to this article in the video description. But basically, Liu and Layland described in this article that the simple static priority-based scheduler that you've just implemented can be mathematically proven to meet all hard-real-time deadlines for all threads under certain conditions. The method was later generalized and called "Rate-Monotonic Analysis" (RMA) or "Rate-Monotonic Scheduling" (RMS), and has been very well explained in another article "Rate Monotonic Analysis for Real-Time Systems", to which I also provide a web link in the video description. I bring it up here, because no discussion of priority-based scheduler can be complete without mentioning the RMA slash RMS method. The term rate monotonic (RM) derives from a method of assigning priorities to a set of threads as a monotonic function of the rate of a (periodic) thread. A function is called "monotonic" in mathematics, when it preserves (or reverses) the order between two ordered sets. Specifically, RMA refers to priority assignment that maps increasing order of thread rates to increasing order of thread priorities. As such, it is just a fancy name for the simple rule you have discovered yourself. But there is of course more to RMA than this, and the full discussion would be really out of scope for this short lesson. For today, let me just summarize the most important guidelines of the RMA/RMS method: First, always assign thread priorities monotonically, meaning that threads with higher rates must run at higher priorities than threads with lower rates. Second, you need to know the CPU utilization of each thread, which you calculate as the ratio between the measured execution time Cn to the period Tn. And third, you need to calculate the total CPU utilization as the sum of all individual CPU utilization factors. If this total utilization is below the theoretical bound, all threads in the set are guaranteed to meet their deadlines. This was already proven mathematically by Liu and Layland back in the 1973 paper. The utilization bound U(n) depends on the number of threads n. For the large number of threads, U(n) approaches natural logarithm of 2, which is just under 0.7. So, in practice, if you stay below 70% of the CPU utilization, your set of threads will be schedulable, meaning that they will all meet their deadlines. For example, the total CPU utilization for your Blinky1 and Blinky2 threads is as follows: 1.2ms/2ms for Blinky1 plus 3.6ms/54ms for Blinky2. The total CPU utilization is thus 0.666, which is below the theoretical bound. Of course, the basic RMA assumes periodic threads that execute in constant time. But the method can be extended to aperiodic threads with variable execution time. In that case you need to consider the worst-case, that is the shortest time between the thread activations and the longest execution time. Also, in practice typically only a few highest-priority threads have hard-real-time deadlines, while other threads have only soft-real-time requirements. In that case, you use RMA for the hard-real-time threads and prioritize all soft-real-time threads lower. The beauty of preemptive priority-based scheduling is that a high-priority thread can always immediately preempt all lower-priority threads, so a high-priority thread is insensitive to changes in execution time or period of lower-priority threads. In other words, the preemptive, priority-based scheduler decouples threads in the time domain. For all these reasons, the preemptive, priority-based scheduler became the norm and is supported in most Real-Time Operating Systems to this day. This concludes this lesson on real-time computing and the priority-based scheduling. In the next lesson, you'll advance by another decade from the 1970's to the 1980's, when the commercial RTOSes went mainstream, and when inter-thread synchronization and communication mechanisms have been added. If you like this channel, please subscribe to stay tuned. You can also visit state-machine.com/quickstart for the class notes and project file downloads. Page 11, 9/8/2018 - 12:08:03 PM