Welcome to the Modern Embedded Systems Programming course. My name is 
Miro Samek and in this forth lesson on Real-Time Operating System 
(RTOS) I'll show you how to replace the horribly inefficient polling 
for events with efficient BLOCKING of threads. Specifically, in this 
lesson you will add a blocking delay function to the MiROS RTOS and 
you'll learn about some far-reaching implications of thread blocking 
on the RTOS design.

As usual, let's get started by making a copy of the previous lesson 24
directory and renaming it to lesson 25.

Get inside the new lesson 25 directory and double-click on the uVision
project "lesson" to open it.

To remind you quickly what happened so far, in the last lesson you 
brought the MiROS RTOS to the stage, where it can run fully 
autonomously, performing simple round robing scheduling of your 
threads.

However, the threads still use a primitive polling inside the 
BSP_delay() function.

The brain-dead polling inside BSP_delay() is a horribly inefficient 
waste of the CPU cycles that could be put to a better use somewhere 
else.

But how could you eliminate this waste? Well, you now have a new tool 
in your toolbox, which is the context switch. With this tool, you 
could approach the delay() function completely differently.
Instead of starting a brain-dead polling loop, at the begin of the 
delay you could switch the context away from the delayed thread and 
then switch the context back to it at the end of the delay. In between
these two context switches the thread will be very efficiently BLOCKED
consuming no CPU cycles at all.

From the perspective of the thread, the blocking delay will be no 
different whatsoever from the polling delay. Either way, the thread 
will simply call a delay() function, which will not return until the 
delay elapses.

But from the perspective of a system as a whole, this would make all 
the difference, because now the CPU cycles between the two context 
switches will be available to other threads that actually have 
something to do.

From the description so far, it should be clear that such a blocking 
delay implementation must become a part of the RTOS, because you need 
the scheduler and the context switch. As a part of the RTOS then, the 
blocking delay operation will be called the OS_delay() function.

At this point I'd like to call your attention to an interesting and 
important trend, which is migration of functionality, such as delay() 
in this case, from the application to the system-level software, where
it can be handled much more efficiently. You will encounter this trend
repeatedly as you learn about more advanced software techniques that 
I'll cover in the future lessons.

But before jumping into the implementation of the OS_delay() function,
you need to realize one important property of a blocked thread.

To see this better, you need to use a timing diagram with all context 
switches that are happening, but which I have omitted for simplicity 
so far.

As you can see, a thread running a polling BSP_delay() function 
participates in the round-robin scheduling, meaning that sometimes the
thread is scheduled in and sometimes it is scheduled out.
In contrast, a thread that is in the Blocked state should never be 
scheduled-in. It simply is NOT READY to run. This means that your RTOS
threads have a new property, which tells the scheduler whether a 
thread is READY or not READY to run.

One other way to visualize this property graphically is through a 
state diagram. I will introduce state diagrams more formally in the 
future lessons about state machines, but today I just want to show you
a diagram depicting the life cycle of a thread.

When a thread is first created, meaning that you allocated the 
OSThread object and the stack for it, it becomes Dormant, shown here 
as a rounded-corner rectangle. In this state, it cannot do anything 
until you have called the QSThread_start() function on it.
After that call, a thread looks exactly as though it was preempted by 
an interrupt, so it transitions to the stage of its life cycle shown 
as Preempted.

Finally, when the thread is scheduled for running, it transitions to 
the Running state. After a while, the thread gets scheduled out and 
another thread gets scheduled in, at which point the original thread 
becomes Preempted again.

Please note that in the single CPU system exactly one thread can be in
the Running state at a time, which is shown here as a circle with 
number 1 in it.

But now, a Running thread can call the OS_delay() function, at which 
time it becomes Blocked, meaning NOT Ready to run.

Now, let's think for a minute about the transition out of the Blocked 
state, which must happen at the end of the delay. This needs to be 
managed centrally for all threads by the RTOS, obviously, because a 
Blocked thread cannot do anything, and in particular, it cannot un-
block itself.

This central RTOS service will need to be activated periodically, 
typically from the system clock tick interrupt, and therefore will be 
called OS_tick(). Upon every activation, OS_tick() needs to update all
delays of the individual threads, and needs to un-block the threads, 
whose delays have elapsed.

An interesting observation now is that if a thread is NOT Ready in the
Blocked state, it must be Ready in the Preempted and Running states.
This can be shown graphically, by means of the Ready "super"-state 
encompassing those two states. And indeed, while in the Ready super-
state, which means either in Preempted or Running, the thread is ready
and willing to run. Except sometimes it needs to be Preempted to share
the CPU with the other threads that are also ready and willing to run.

The Not-Ready-To-Run property of the new Blocked state in the thread 
life cycle has also another interesting consequence. And that is, that
a system of blocking threads is incomplete and necessarily requires 
one special thread that is always ready to run and cannot block.

To see this, consider a system with two threads, like your Blinky1 and
Blinky2. As long as both of them are in the Ready super-state, the 
scheduler can run one or the other.

When one of the threads goes to the Blocked state, the scheduler can 
still run the other thread.

But what if both threads become Blocked at some time? The CPU still 
must be running exactly one thread at a time, but none of the existing
ones are Ready.

The solution is to add another special thread, which the scheduler can
run when no other threads are Ready. This situation in the system is 
called the "idle" condition and therefore the special thread is called
the "idle" thread. This "idle" thread is not allowed to block, so its 
life cycle does not have the Blocked state.

So, as you can see, thread blocking has surprisingly many 
consequences, all of which you need to carefully consider in your 
implementation.

Going back to the code, then, let's start with creating the idle 
thread.

At first, the idle thread will be created exactly as any other thread 
and only later you will add protections against it ever going into the
Blocked state.

So let's just go to main.c and copy the boilerplate code for one of 
your blinky threads. After pasting it into miros.c rename blinky1 to 
idleThread.

The thread routine for the idleThread will still have the endless 
"superloop" structure, but it will simply invoke a callback function 
OS_onIdle(), where the application-level code will be able to perform 
some processing.

Just like any other thread, the idle thread needs to be started. The 
most convenient place for this is OS_init().

Now, the interesting part is how to supply the stack for the idle 
thread. One option is to pre-allocate it in the RTOS, but at this 
level you don't know how much stack would be used by the OS_onIdle() 
callback.

Therefore, it is better to simply defer the idle stack allocation to 
the application, just as it is done in the OSThread_start() function.
So now, you need to update the OS_init() signature in the miros.h 
header file, where you also need to add the OS_onIdle() callback 
prototype.

When you try to build the project now, you get errors as the RTOS API 
has changed.

The fix the problems, you need to add the stack for the idle thread to
the OS_init() call in main.c

And you also need to define the OS_onIdle() callback in bsp.c. At 
first, let's just make the idle callback do nothing so that the 
project links correctly.

Moving on to implementing the centralized thread delay management, you
need to augment the OSThread object by adding a private timeout 
counter to each thread. These timeout counters will work as follows:
The OS_delay() function will start with loading the timeout counter 
with the number of clock ticks.

Subsequently, every clock tick will call the OS_tick() function, which
will decrement all non-zero timeout counters. When any of these down-
counter reaches zero, the corresponding thread will be un-blocked and 
scheduled.

As you can see, having a separate timeout counter in each thread will 
allow the delays to run in parallel and independently from each other.
Of course, OS_tick() will run all the time, but all timeout counters 
that are already zero will be simply left alone.

The next new attribute of a thread is a boolean flag, which will be 
'true' when the thread is Ready and 'false' when it is NOT Ready to 
run. Putting such a flag in each thread object would be a logical 
thing to do, but it turns out that it is not optimal for the 
efficiency of the code.

Instead, it turns out that it is more efficient to group the ready-to-
run flags together in a 32-bit bitmask OS_readySet. This OS_readySet 
bitmask will work together with the OS_thread[] array as follows:
Each individual bit in the OS_readySet bitmask corresponds to a thread
in the OS_thread[] array, with the idle Thread skipped, because it is 
always Ready to run.

The idle thread is always at index zero, because it has been started 
from OS_init(), which ensures that it is the very first thread in the
OS_thread[] array. Therefore, skipping the index zero means that the
bits in OS_readySet are simply shifted by one with respect to indexes
into the OS_thread[] array.

For example, a thread at index 1 corresponds to bit 0, thread at index
2 to bit 1, a thread at index n corresponds to bit n-1 and finally the
last thread at index 32 corresponds to the last bit number 31 in the 
OS_readySet bitmask.

The various values of the OS_readySet bitmask correspond to the 
following situations:
bits 0,1 and n-1 set, mean that threads 1,2, and n are ready to run.
Only bit 1 set represents thread 2 ready to run.
And finally, all bits at zero represent the Idle Condition of the 
system. This situation can be checked in the code very efficiently by 
simply comparing OS_readySet to zero in a single instruction.

The other benefits of grouping the ready-to-run flags together in one 
bitmask will become clear in the next lesson, where you will implement
priority-based scheduling.

In that case the bit numbers in OS_readySet will represent thread 
priorities.

But for now, you are still working with round-robin scheduling, where 
there is no notion of thread priority.

Going back to code, let's now modify the scheduler so that it will 
avoid scheduling threads that are NOT ready to run.

First, you quickly check for the idle condition by comparing 
OS_readySet to zero, in which case you set OS_currIdx to the index of 
the idle thread, that is to zero. As you will see later, this will the
most frequent path through the code.

Otherwise, if OS_readySet is not zero, you have some ready-to-run 
threads. But now you cannot simply choose the next thread index, 
because the thread might not be ready to run.

Instead, you need to keep going in a round-robing fashion until you 
find a non-idle thread that IS ready to run.

As shown in this diagram, to check if thread-N is ready to run, you 
need to synthesize a bitmask that has only N-minus-oneth bit set and 
you need to bitwise-AND it with OS_readySet. If the result is zero, 
you need to keep going, because it means that the bit is not set so 
the corresponding thread is NOT ready to run.

Also, you need to be careful to exclude the idle thread from the 
round-robin scheduling, by skipping the index 0.

So, here is the final algorithm. You advance the OS_currIdx in round-
robin fashion, but you skip the idle thread by wrapping around to 1 
instead of zero. You keep going as long as the OS_readySet indicates 
that the thread is NOT ready to run.

Now, you are finally in position to tackle the most interesting new 
service of your RTOS, which is the blocking OS_delay() function.
The signature of the function will be identical as the polling 
BSP_delay().

Inside OS_delay(), you need to disable interrupts, and store the tick 
count in the current thread's timeout counter.

Next, you need to make the thread NOT ready to run by clearing the 
corresponding bit in the OS_readySet bitmask.

And finally, you need to call the scheduler to immediately switch the 
context away from this thread, so that it can become Blocked.

As you remember from the previous lessons, OS_sched() is designed to 
be called with interrupts disabled and the context switch happens 
right after the interrupts are re-enabled.

The last touch in the OS_delay() implementation is to forbid calling 
it from the idle thread, because it should never go to the Blocked 
state.

You can accomplish this by adding an assertion that the current thread
is NOT the idle thread at index 0.

This specific assertion type is called pre-condition, because it 
introduces a specific requirement that must be met before calling this
service. To make this intent clear, the quassert.h header file 
provides a specific macro named Q_REQUIRE(), which works exactly like 
Q_ASSERT(), except its name conveys the intent much more precisely.
To quickly summarize the status so far, you have implemented the idle 
thread and you have added the Blocked state to the thread's lifecycle.
You have also just implemented the transition to the Blocked state. 
This transition is unique in that it is the only one in the thread's 
lifecycle that is taken voluntarily by the Running thread itself. All 
other transitions are forced on the thread without necessarily its 
cooperation or consent.

This includes the last transition you still need to implement, which 
is the OS_tick() service that would take a thread out of the Blocked 
state.

The implementation of OS_tick() needs to loop over all threads, except
the idle thread obviously, and decrement all non-zero timeout 
counters. The threads corresponding to counters that just become zero 
need to be un-blocked.

The code of OS_tick() will therefore consist of a for-loop over all 
threads, starting with index 1 to skip the idle thread.
Inside the loop you first need to check the timeout counter of the 
corresponding thread,

and if it is not zero, you need to decrement it.

If the counter is becoming zero, you need to make the thread ready-to-
run by setting the corresponding bit in the OS_readySet bitmask.
Please note that this implementation assumes that OS_tick() will be 
called from a single Interrupt Service Routine, such as 
SysTick_Handler(). In that case, you don't need to disable interrupts 
inside OS_tick(), because an ISR cannot be preempted by a thread, so 
any unexpected changing of the timeout counters, which can happen only
in OS_delay(), is not possible.

Also, it is not necessary to call OS_sched() from OS_tick(), because 
the scheduler is called at the end of SysTick_Handler anyway to 
schedule any un-blocked threads.

The implementation of the blocking delay is almost ready now, but you 
still need to make some final adjustments.

First, any started thread, except the idle thread, needs to be 
explicitly made Ready-to-run by setting the corresponding bit in the 
OS_readySet bitmask.

This corresponds to the transition into the Ready super-state in the 
thread's life-cycle.

Second, you should increment the MiROS RTOS version number to lesson 
25 both in the miros.c and in miros.h.

And third, in miros.h, you need to add the prototypes for the newly 
added services: OS_delay() and OS_tick().

And finally, you need to actually use the new OS_dealy() and OS_tick()
instead of the previous polling BSP_delay() and BSP_tickCtr().

The code builds cleanly, so now I understand that you are quite keen 
to see how it works.

Let's then open the debugger
and at first run the code free.
What do you know, the code appears to be blinking all LEDs exactly as 
before.

But now, it really works quite differently. When you break into the 
code, you find it in the OS_idle() callback.

Indeed, when you set a breakpoint in the scheduler,
you can see that the OS_readySet bitmask is zero,
so the scheduler picks the idle thread. All this means that most of 
the time all threads are blocked, so the system spends most of its 
time in the idle condition.

Only occasionally, a thread gets unblocked and scheduled to run.
All right, as you can see the idle thread has become quite important, 
because the CPU spends most of its time there. So, let's study it a 
bit more.

Specifically, let's put some code into the OS_onIdle() callback.
I'll use the technique from the previous lesson to toggle a pin up and
down from the OS_onIdle() callback, so that this activity could be 
observed with a logic analyzer.

As I'm running out of pins, I will reuse the RED LED from the Blinky3 
thread.

To avoid any conflicts around the RED LED, I will simply comment out 
starting of the Blinky3 thread in main.c, so that this thread will 
stay in the Dormant state.

When I build and load the code to the board now, the Blue and Green 
LEDs blink as before, but the Red LED only glows at a low intensity.
Now, let's connect logic analyzer signals D1 through D4 to pins PF1 
through PF4, respectively.

As you can see in the logic analyzer view, the signal D1, 
corresponding to the RED LED keeps toggling rapidly, which means that 
the OS_onIdle() callback is constantly called. The signals for the 
Green and Blue LEDs change very slowly in comparison. The signal D4, 
which is switched in the SysTick ISR, is used here as the trigger.
Most of the time the SysTick ISR takes about 3 microseconds, but 
occasionally it gets significantly longer.

To investigate this behavior, let's look through the accumulated 
traces.

Indeed, here is an instance where SysTick gets longer. Interestingly, 
at the same time the Green LED changes from low to high and the idle 
thread activity gets delayed by over 5 microseconds.
The explanation is quite simple. Most of the time, SysTick simply 
updates the timeout counters, but does not switch the context from the
idle thread. You can see this directly from the idle thread activity 
before and after the SysTick.

Only occasionally, a thread other than idle gets unblocked and 
scheduled. After the SysTick, the unblocked thread runs and does its 
thing, such as toggling an LED in this case. After this, the thread 
calls OS_delay() and voluntarily relinquishes the CPU, which switches 
back to the idle thread.

Now, regarding the actual timing, the longest SysTick takes about 4.5 
microseconds, which is three times longer than before blocking was 
introduced. This increased overhead is caused by more complex 
OS_tick() and OS_sched() implementations.
On the other hand, the context switch time remains about the same, 
below 2 microseconds.

The tread blocking in OS_delay(), which involves marking the thread as
NOT ready and scheduling another thread takes about 3.8 microseconds.
Finally, the longest time spend outside the idle thread is only 10 
microseconds. In most other cases this time is even shorter.
With all this, I hope to have convinced you that indeed the CPU spends
almost all of its time in the idle thread, except for a few 
microseconds here and there to execute all other threads.

But the idle tread still basically wastes all the CPU cycles it gets. 
So the question is what can you do about it.
Well, the best thing you can do is to put the CPU and its peripherals 
to a low-power sleep mode. Indeed, you will never achieve a truly low-
power design with the CPU and peripherals running at full speed. So, 
for battery-operated applications you have to use sleep modes.
The full discussion of the subject of low-power design goes far beyond
the scope of this lesson, but the most important takeaway for you 
today is that the idle thread, and specifically the OS_onIdle() 
callaback is the ideal, central place in the code to put the CPU and 
peripherals to a low-power sleep mode.

To this end, the ARM Cortex-M CPU provides a special instruction, 
called WFI (Wait-For-Interrupt) that shuts down the CPU clock until 
and interrupt occurs. You can put it in your OS_onIdle() callback 
using __WFI() CMSIS function and see what happens.
When you load this code to the board, you can see that the Green and 
Blue LEDs blink exactly as before, but the Red LED seems not active at
all.

The situation becomes clearer in the logic analyzer, where you can see
that the Red LED actually toggles, but very infrequently. In fact, it 
toggles exactly once after each interrupt.
This is because now, the CPU gets stopped inside each call to 
OS_onIdle(), and is only waken up by the next SysTick interrupt.

This concludes this lesson about efficient thread blocking and its 
profound implications.

At this point, your MiROS RTOS implements a round-robin timesharing 
scheduler with blocking, which corresponds to the state of the art of 
computer systems in the early 1960's.

Here is an introduction of the time-sharing system at MIT in the year 
1963.

In the next lesson you'll bring the MiROS RTOS into the 1970's by 
implementing the preemptive priority-based scheduling.

If you like this channel, please subscribe to stay tuned. You can also
visit state-machine.com/quickstart for the class notes and project 
file downloads.