Welcome to the Modern Embedded Systems Programming course. My name is Miro Samek and in this forth lesson on Real-Time Operating System (RTOS) I'll show you how to replace the horribly inefficient polling for events with efficient BLOCKING of threads. Specifically, in this lesson you will add a blocking delay function to the MiROS RTOS and you'll learn about some far-reaching implications of thread blocking on the RTOS design. As usual, let's get started by making a copy of the previous lesson 24 directory and renaming it to lesson 25. Get inside the new lesson 25 directory and double-click on the uVision project "lesson" to open it. To remind you quickly what happened so far, in the last lesson you brought the MiROS RTOS to the stage, where it can run fully autonomously, performing simple round robing scheduling of your threads. However, the threads still use a primitive polling inside the BSP_delay() function. The brain-dead polling inside BSP_delay() is a horribly inefficient waste of the CPU cycles that could be put to a better use somewhere else. But how could you eliminate this waste? Well, you now have a new tool in your toolbox, which is the context switch. With this tool, you could approach the delay() function completely differently. Instead of starting a brain-dead polling loop, at the begin of the delay you could switch the context away from the delayed thread and then switch the context back to it at the end of the delay. In between these two context switches the thread will be very efficiently BLOCKED consuming no CPU cycles at all. From the perspective of the thread, the blocking delay will be no different whatsoever from the polling delay. Either way, the thread will simply call a delay() function, which will not return until the delay elapses. But from the perspective of a system as a whole, this would make all the difference, because now the CPU cycles between the two context switches will be available to other threads that actually have something to do. From the description so far, it should be clear that such a blocking delay implementation must become a part of the RTOS, because you need the scheduler and the context switch. As a part of the RTOS then, the blocking delay operation will be called the OS_delay() function. At this point I'd like to call your attention to an interesting and important trend, which is migration of functionality, such as delay() in this case, from the application to the system-level software, where it can be handled much more efficiently. You will encounter this trend repeatedly as you learn about more advanced software techniques that I'll cover in the future lessons. But before jumping into the implementation of the OS_delay() function, you need to realize one important property of a blocked thread. To see this better, you need to use a timing diagram with all context switches that are happening, but which I have omitted for simplicity so far. As you can see, a thread running a polling BSP_delay() function participates in the round-robin scheduling, meaning that sometimes the thread is scheduled in and sometimes it is scheduled out. In contrast, a thread that is in the Blocked state should never be scheduled-in. It simply is NOT READY to run. This means that your RTOS threads have a new property, which tells the scheduler whether a thread is READY or not READY to run. One other way to visualize this property graphically is through a state diagram. I will introduce state diagrams more formally in the future lessons about state machines, but today I just want to show you a diagram depicting the life cycle of a thread. When a thread is first created, meaning that you allocated the OSThread object and the stack for it, it becomes Dormant, shown here as a rounded-corner rectangle. In this state, it cannot do anything until you have called the QSThread_start() function on it. After that call, a thread looks exactly as though it was preempted by an interrupt, so it transitions to the stage of its life cycle shown as Preempted. Finally, when the thread is scheduled for running, it transitions to the Running state. After a while, the thread gets scheduled out and another thread gets scheduled in, at which point the original thread becomes Preempted again. Please note that in the single CPU system exactly one thread can be in the Running state at a time, which is shown here as a circle with number 1 in it. But now, a Running thread can call the OS_delay() function, at which time it becomes Blocked, meaning NOT Ready to run. Now, let's think for a minute about the transition out of the Blocked state, which must happen at the end of the delay. This needs to be managed centrally for all threads by the RTOS, obviously, because a Blocked thread cannot do anything, and in particular, it cannot un- block itself. This central RTOS service will need to be activated periodically, typically from the system clock tick interrupt, and therefore will be called OS_tick(). Upon every activation, OS_tick() needs to update all delays of the individual threads, and needs to un-block the threads, whose delays have elapsed. An interesting observation now is that if a thread is NOT Ready in the Blocked state, it must be Ready in the Preempted and Running states. This can be shown graphically, by means of the Ready "super"-state encompassing those two states. And indeed, while in the Ready super- state, which means either in Preempted or Running, the thread is ready and willing to run. Except sometimes it needs to be Preempted to share the CPU with the other threads that are also ready and willing to run. The Not-Ready-To-Run property of the new Blocked state in the thread life cycle has also another interesting consequence. And that is, that a system of blocking threads is incomplete and necessarily requires one special thread that is always ready to run and cannot block. To see this, consider a system with two threads, like your Blinky1 and Blinky2. As long as both of them are in the Ready super-state, the scheduler can run one or the other. When one of the threads goes to the Blocked state, the scheduler can still run the other thread. But what if both threads become Blocked at some time? The CPU still must be running exactly one thread at a time, but none of the existing ones are Ready. The solution is to add another special thread, which the scheduler can run when no other threads are Ready. This situation in the system is called the "idle" condition and therefore the special thread is called the "idle" thread. This "idle" thread is not allowed to block, so its life cycle does not have the Blocked state. So, as you can see, thread blocking has surprisingly many consequences, all of which you need to carefully consider in your implementation. Going back to the code, then, let's start with creating the idle thread. At first, the idle thread will be created exactly as any other thread and only later you will add protections against it ever going into the Blocked state. So let's just go to main.c and copy the boilerplate code for one of your blinky threads. After pasting it into miros.c rename blinky1 to idleThread. The thread routine for the idleThread will still have the endless "superloop" structure, but it will simply invoke a callback function OS_onIdle(), where the application-level code will be able to perform some processing. Just like any other thread, the idle thread needs to be started. The most convenient place for this is OS_init(). Now, the interesting part is how to supply the stack for the idle thread. One option is to pre-allocate it in the RTOS, but at this level you don't know how much stack would be used by the OS_onIdle() callback. Therefore, it is better to simply defer the idle stack allocation to the application, just as it is done in the OSThread_start() function. So now, you need to update the OS_init() signature in the miros.h header file, where you also need to add the OS_onIdle() callback prototype. When you try to build the project now, you get errors as the RTOS API has changed. The fix the problems, you need to add the stack for the idle thread to the OS_init() call in main.c And you also need to define the OS_onIdle() callback in bsp.c. At first, let's just make the idle callback do nothing so that the project links correctly. Moving on to implementing the centralized thread delay management, you need to augment the OSThread object by adding a private timeout counter to each thread. These timeout counters will work as follows: The OS_delay() function will start with loading the timeout counter with the number of clock ticks. Subsequently, every clock tick will call the OS_tick() function, which will decrement all non-zero timeout counters. When any of these down- counter reaches zero, the corresponding thread will be un-blocked and scheduled. As you can see, having a separate timeout counter in each thread will allow the delays to run in parallel and independently from each other. Of course, OS_tick() will run all the time, but all timeout counters that are already zero will be simply left alone. The next new attribute of a thread is a boolean flag, which will be 'true' when the thread is Ready and 'false' when it is NOT Ready to run. Putting such a flag in each thread object would be a logical thing to do, but it turns out that it is not optimal for the efficiency of the code. Instead, it turns out that it is more efficient to group the ready-to- run flags together in a 32-bit bitmask OS_readySet. This OS_readySet bitmask will work together with the OS_thread[] array as follows: Each individual bit in the OS_readySet bitmask corresponds to a thread in the OS_thread[] array, with the idle Thread skipped, because it is always Ready to run. The idle thread is always at index zero, because it has been started from OS_init(), which ensures that it is the very first thread in the OS_thread[] array. Therefore, skipping the index zero means that the bits in OS_readySet are simply shifted by one with respect to indexes into the OS_thread[] array. For example, a thread at index 1 corresponds to bit 0, thread at index 2 to bit 1, a thread at index n corresponds to bit n-1 and finally the last thread at index 32 corresponds to the last bit number 31 in the OS_readySet bitmask. The various values of the OS_readySet bitmask correspond to the following situations: bits 0,1 and n-1 set, mean that threads 1,2, and n are ready to run. Only bit 1 set represents thread 2 ready to run. And finally, all bits at zero represent the Idle Condition of the system. This situation can be checked in the code very efficiently by simply comparing OS_readySet to zero in a single instruction. The other benefits of grouping the ready-to-run flags together in one bitmask will become clear in the next lesson, where you will implement priority-based scheduling. In that case the bit numbers in OS_readySet will represent thread priorities. But for now, you are still working with round-robin scheduling, where there is no notion of thread priority. Going back to code, let's now modify the scheduler so that it will avoid scheduling threads that are NOT ready to run. First, you quickly check for the idle condition by comparing OS_readySet to zero, in which case you set OS_currIdx to the index of the idle thread, that is to zero. As you will see later, this will the most frequent path through the code. Otherwise, if OS_readySet is not zero, you have some ready-to-run threads. But now you cannot simply choose the next thread index, because the thread might not be ready to run. Instead, you need to keep going in a round-robing fashion until you find a non-idle thread that IS ready to run. As shown in this diagram, to check if thread-N is ready to run, you need to synthesize a bitmask that has only N-minus-oneth bit set and you need to bitwise-AND it with OS_readySet. If the result is zero, you need to keep going, because it means that the bit is not set so the corresponding thread is NOT ready to run. Also, you need to be careful to exclude the idle thread from the round-robin scheduling, by skipping the index 0. So, here is the final algorithm. You advance the OS_currIdx in round- robin fashion, but you skip the idle thread by wrapping around to 1 instead of zero. You keep going as long as the OS_readySet indicates that the thread is NOT ready to run. Now, you are finally in position to tackle the most interesting new service of your RTOS, which is the blocking OS_delay() function. The signature of the function will be identical as the polling BSP_delay(). Inside OS_delay(), you need to disable interrupts, and store the tick count in the current thread's timeout counter. Next, you need to make the thread NOT ready to run by clearing the corresponding bit in the OS_readySet bitmask. And finally, you need to call the scheduler to immediately switch the context away from this thread, so that it can become Blocked. As you remember from the previous lessons, OS_sched() is designed to be called with interrupts disabled and the context switch happens right after the interrupts are re-enabled. The last touch in the OS_delay() implementation is to forbid calling it from the idle thread, because it should never go to the Blocked state. You can accomplish this by adding an assertion that the current thread is NOT the idle thread at index 0. This specific assertion type is called pre-condition, because it introduces a specific requirement that must be met before calling this service. To make this intent clear, the quassert.h header file provides a specific macro named Q_REQUIRE(), which works exactly like Q_ASSERT(), except its name conveys the intent much more precisely. To quickly summarize the status so far, you have implemented the idle thread and you have added the Blocked state to the thread's lifecycle. You have also just implemented the transition to the Blocked state. This transition is unique in that it is the only one in the thread's lifecycle that is taken voluntarily by the Running thread itself. All other transitions are forced on the thread without necessarily its cooperation or consent. This includes the last transition you still need to implement, which is the OS_tick() service that would take a thread out of the Blocked state. The implementation of OS_tick() needs to loop over all threads, except the idle thread obviously, and decrement all non-zero timeout counters. The threads corresponding to counters that just become zero need to be un-blocked. The code of OS_tick() will therefore consist of a for-loop over all threads, starting with index 1 to skip the idle thread. Inside the loop you first need to check the timeout counter of the corresponding thread, and if it is not zero, you need to decrement it. If the counter is becoming zero, you need to make the thread ready-to- run by setting the corresponding bit in the OS_readySet bitmask. Please note that this implementation assumes that OS_tick() will be called from a single Interrupt Service Routine, such as SysTick_Handler(). In that case, you don't need to disable interrupts inside OS_tick(), because an ISR cannot be preempted by a thread, so any unexpected changing of the timeout counters, which can happen only in OS_delay(), is not possible. Also, it is not necessary to call OS_sched() from OS_tick(), because the scheduler is called at the end of SysTick_Handler anyway to schedule any un-blocked threads. The implementation of the blocking delay is almost ready now, but you still need to make some final adjustments. First, any started thread, except the idle thread, needs to be explicitly made Ready-to-run by setting the corresponding bit in the OS_readySet bitmask. This corresponds to the transition into the Ready super-state in the thread's life-cycle. Second, you should increment the MiROS RTOS version number to lesson 25 both in the miros.c and in miros.h. And third, in miros.h, you need to add the prototypes for the newly added services: OS_delay() and OS_tick(). And finally, you need to actually use the new OS_dealy() and OS_tick() instead of the previous polling BSP_delay() and BSP_tickCtr(). The code builds cleanly, so now I understand that you are quite keen to see how it works. Let's then open the debugger and at first run the code free. What do you know, the code appears to be blinking all LEDs exactly as before. But now, it really works quite differently. When you break into the code, you find it in the OS_idle() callback. Indeed, when you set a breakpoint in the scheduler, you can see that the OS_readySet bitmask is zero, so the scheduler picks the idle thread. All this means that most of the time all threads are blocked, so the system spends most of its time in the idle condition. Only occasionally, a thread gets unblocked and scheduled to run. All right, as you can see the idle thread has become quite important, because the CPU spends most of its time there. So, let's study it a bit more. Specifically, let's put some code into the OS_onIdle() callback. I'll use the technique from the previous lesson to toggle a pin up and down from the OS_onIdle() callback, so that this activity could be observed with a logic analyzer. As I'm running out of pins, I will reuse the RED LED from the Blinky3 thread. To avoid any conflicts around the RED LED, I will simply comment out starting of the Blinky3 thread in main.c, so that this thread will stay in the Dormant state. When I build and load the code to the board now, the Blue and Green LEDs blink as before, but the Red LED only glows at a low intensity. Now, let's connect logic analyzer signals D1 through D4 to pins PF1 through PF4, respectively. As you can see in the logic analyzer view, the signal D1, corresponding to the RED LED keeps toggling rapidly, which means that the OS_onIdle() callback is constantly called. The signals for the Green and Blue LEDs change very slowly in comparison. The signal D4, which is switched in the SysTick ISR, is used here as the trigger. Most of the time the SysTick ISR takes about 3 microseconds, but occasionally it gets significantly longer. To investigate this behavior, let's look through the accumulated traces. Indeed, here is an instance where SysTick gets longer. Interestingly, at the same time the Green LED changes from low to high and the idle thread activity gets delayed by over 5 microseconds. The explanation is quite simple. Most of the time, SysTick simply updates the timeout counters, but does not switch the context from the idle thread. You can see this directly from the idle thread activity before and after the SysTick. Only occasionally, a thread other than idle gets unblocked and scheduled. After the SysTick, the unblocked thread runs and does its thing, such as toggling an LED in this case. After this, the thread calls OS_delay() and voluntarily relinquishes the CPU, which switches back to the idle thread. Now, regarding the actual timing, the longest SysTick takes about 4.5 microseconds, which is three times longer than before blocking was introduced. This increased overhead is caused by more complex OS_tick() and OS_sched() implementations. On the other hand, the context switch time remains about the same, below 2 microseconds. The tread blocking in OS_delay(), which involves marking the thread as NOT ready and scheduling another thread takes about 3.8 microseconds. Finally, the longest time spend outside the idle thread is only 10 microseconds. In most other cases this time is even shorter. With all this, I hope to have convinced you that indeed the CPU spends almost all of its time in the idle thread, except for a few microseconds here and there to execute all other threads. But the idle tread still basically wastes all the CPU cycles it gets. So the question is what can you do about it. Well, the best thing you can do is to put the CPU and its peripherals to a low-power sleep mode. Indeed, you will never achieve a truly low- power design with the CPU and peripherals running at full speed. So, for battery-operated applications you have to use sleep modes. The full discussion of the subject of low-power design goes far beyond the scope of this lesson, but the most important takeaway for you today is that the idle thread, and specifically the OS_onIdle() callaback is the ideal, central place in the code to put the CPU and peripherals to a low-power sleep mode. To this end, the ARM Cortex-M CPU provides a special instruction, called WFI (Wait-For-Interrupt) that shuts down the CPU clock until and interrupt occurs. You can put it in your OS_onIdle() callback using __WFI() CMSIS function and see what happens. When you load this code to the board, you can see that the Green and Blue LEDs blink exactly as before, but the Red LED seems not active at all. The situation becomes clearer in the logic analyzer, where you can see that the Red LED actually toggles, but very infrequently. In fact, it toggles exactly once after each interrupt. This is because now, the CPU gets stopped inside each call to OS_onIdle(), and is only waken up by the next SysTick interrupt. This concludes this lesson about efficient thread blocking and its profound implications. At this point, your MiROS RTOS implements a round-robin timesharing scheduler with blocking, which corresponds to the state of the art of computer systems in the early 1960's. Here is an introduction of the time-sharing system at MIT in the year 1963. In the next lesson you'll bring the MiROS RTOS into the 1970's by implementing the preemptive priority-based scheduling. If you like this channel, please subscribe to stay tuned. You can also visit state-machine.com/quickstart for the class notes and project file downloads.