Welcome to the Modern Embedded Systems Programming course. My
name is Miro Samek and in this fifth lesson on RTOS I'll finally
address the real-time aspect in the "Real-Time Operating System"
name. Specifically, in this lesson you will augment the MiROS
RTOS with a preemptive, priority-based scheduler, which can be
mathematically proven to meet real-time deadlines under certain
conditions.

As usual, let's get started by making a copy of the previous
lesson 25 directory and renaming it to lesson 26.

Get inside the new lesson 26 directory and double-click on the
uVision project "lesson" to open it.

To remind you quickly what happened so far, in the last lesson
you added efficient blocking delay to the MiROS RTOS. However,
the RTOS is not yet worthy of it's name, because the round-robin
scheduler is not really "Real-Time" yet. So, let's start today
with introducing the concept of "Real-Time".

Until now, in all your code, any computation that was performed
was considered equally useful, and you cared only whether the
computation was correct, but not whether it was performed within
a given time. "Real-Time" adds the timeliness requirement to the
computations.

Specifically, a computation that is performed too late (or too
early for that matter) is considered less useful and even harmful
as an outright wrong computation.

Here is a graphical representation of the "real-time" usefulness
of a computation as a function of time.

In the so called "hard real-time systems", a computation is
useful from the triggering event up to the deadline. After the
deadline, the usefulness of the computation becomes "negative
infinity", which means that the computation is worse that useless
(for that its usefulness would be merely zero). A missed deadline
represents a system failure. For example, deploying an air-bag
too late is not just useless--it is disastrous.

But there are also "soft real-time systems", where timeliness is
also important, but the deadline is not as firm. For example, a
text message is expected to be delivered somewhat timely, say
within 20 seconds. But it is still useful much longer, although
its usefulness diminishes with time.

In the following discussion, I will focus on hard real-time
systems.

From the historical perspective, mainstream use of computers in
hard real-time systems started in the 1960's and 70's.

For example, the Apollo program used two identical real-time
computers: one in the command module and another in the lunar
module. Here is a picture of Margaret Hamilton, who led much of
this effort, with the stack of code listings showing how much
real-time software was created during the 1960's and early 70's.

Already in these early days, people realized that most real-time
systems operate in a periodic fashion, meaning that the
triggering events and the deadlines repeat with a certain period.

For example, landing a spacecraft on the lunar surface required
fine corrections to the rocket thrusters that the on-board
computer had to perform every few milliseconds. Similarly,
industrial processes require periodic control ranging from
milliseconds to tens of seconds. Electronic control of combustion
engines requires periodic control synchronized with the
revolutions of the engine, and so on.

To better understand how the real-time concepts apply to periodic
threads, let's perform a few experiments with your current
version of the MiROS RTOS.

Specifically, you can increase the system clock tick rate to 1000
times per second, that is, one clock tick per millisecond.

Next, you can modify your binky1 and blinky2 threads so that they
actually put some computational load on your CPU, because right
now they only turn the LED on or off, which takes a microsecond,
after which the thread blocks and relinquishes the CPU.

In order to simulate a more realistic CPU loading, you can turn
the LED on and off not once, but a few thousand times in a
for-loop. This particular for-loop, will take about 1.2
milliseconds to execute, which is intentionally slightly longer
than your system clock tick.

After the for-loop, the blinky1 thread delays for 1 clock tick,
which will block until the very next SysTick interrupt. This
means that the body of the wile(1)-loop will repeat with the
period of 2 milliseconds.

At this point, let me introduce a few symbols, commonly used in
the literature to characterize a periodic real-time thread, like
your Blinky1:

The time of computation of Thread1 is denoted as C1 and is equal
to 1.2 milliseconds.

The period of Thread1 is denoted T1 and is equal to 2 milliseconds.

And finally, the processor utilization of Thread1 is denoted U1
and is the ratio between the time of computation and the period.
This is the percentage of the time the CPU spends executing
Thread1, and is 60 percent in this case.

Similarly to Blinky1, you can modify the Blinky2 thread to
simulate a CPU load that runs 3 times longer than Blinky1, that
is for about 3.6 milliseconds.

After the for-loop, the blinky2 thread will delay for 50
milliseconds, so that its total period will be about 54
milliseconds.

In the diagram, the computation time of Blinky2, C2, will be 3.6
milliseconds,

and its period, T2, will be 54 milliseconds.

The CPU utilization of Blinky2, U2, will be 6.6 percent.

One last thing before building the code is to comment out the
Wait-For-Interrupt instruction in the OS_onIdle() callback. This
will prevent the CPU from stopping and will allow you to see the
toggling of the Red LED from the idle thread in the logic
analyzer view.

The code builds correctly, so let's load it into your TivaC
LauchPad board and see how it runs using a logic analyzer.

The top trace, labeled ISR corresponds to the SysTick_Halder, and
it repeats every 1 millisecond. This is your fast 1000 times per
second system clock tick.

The trace immediately below, labeled T1 corresponds to Blinky1
and most of the time it runs for about 1.2 milliseconds.

It repeats every 2 milliseconds.

The trace below, labeled T2 corresponds to Blinky2. This thread
runs for about 3 to 4 milliseconds, not counting the gaps.

T2 repeats every 55 milliseconds.

Finally, the bottom trace, labeled IDL corresponds to the idle
thread, and it runs only when no other thread or ISR is running.

But the most interesting part of the logic analyzer view is the
time when Blinky1 and Blinky2 are both unblocked and are ready to
run at the same time.

When you zoom in, you can see that Blinky1 starts running, but is
scheduled out on the next clock tick, at which point Blinky2 is
scheduled to run. Blinky2 also runs only for one clock tick, and
only then Blinky1 is scheduled in again to finish the processing,
after which it blocks, so Blinky2 runs for the rest of the time
slice.

On the subsequent clock tick, Blinky1 becomes ready to run again,
but notice that this is the THIRD tick from the previous
activation, which means that Blinky1 misses its TWO millisecond
deadline by one tick.

This happens, because the current scheduler inside the MiROS RTOS
is still round-robin. This scheduler was designed for timesharing
systems, where the most important goal was to share the CPU
FAIRLY among all threads. Therefore, the scheduler takes the CPU
away from Blinky1 after it exhausts its time slice.

But this is NOT what you want in a hard-real time system. For
example, suppose that Blinky1 represents an important thread that
controls lunar module's descend to the Moon's surface. It HAS to
run every 2 milliseconds, and missing even one of these deadlines
can cause a catastrophic system failure. In that case, you don't
care about the fairness of CPU assignment. You care about meeting
your 2 millisecond hard real-time deadline.

For this you need a different scheduler, which somehow KNOWS
about the importance of the threads, so that it can execute a
more important thread before executing a less-important thread.

Today, you will implement the simplest version of such a
real-time scheduler, called priority-based, preemptive scheduler
with static priorities. This means that each thread will be
assigned a unique PRIORITY number when it is started and this
priority will not change thereafter. The job of the scheduler
will be to always run the the highest-priority thread that is
ready-to-run.

Let's see in a timing diagram, how your Blinky threads would
execute under such a priority-based scheduler and how the timing
would differ from the current round-robin scheduler.

As long as the thread T1 remains the only one ready-to-run,
because T2 is blocked inside the OS_deleay() function, the
execution is identical under both schedulers.

But as soon as T2 becomes ready-to-run, the round-robin scheduler
suspends T1 and schedules T2. Already at this point T1 looses its
chance to meet the 2 millisecond deadline.

In contrast, the priority-based scheduler also sees that T2 is
becoming ready-to-run, but T1 has a higher PRIORITY, so the
scheduler chooses T1 to run. Therefore, T1 continues and easily
meets its 2 millisecond deadline.

The T2 thread is scheduled only after T1 voluntarily BLOCKS in
the OS_delay() function. But as soon as T1 is unblocked again,
the scheduler immediately switches back to T1, because it has
higher PRIORITY than T2. The situation repeats as long as T2
eventually blocks as well.

But interestingly, notice that even though T2 is significantly
delayed by constant interruptions from T1, it too eventually
completes and blocks way before its deadline of 54 milliseconds.
This means that the priority-based scheduler executes T1 AND T2
in such a way that they BOTH meet their hard real-time deadlines.

So now, let's actually implement the priority-based scheduler in
the MiROS RTOS.

The first thing is to bump up the version number and add the
thread priority to the thread control block and to the
OSThread_start() function. The priority number will be a small
integer in the range from 0 to the number of supported threads,
which is 32 in the MiROS RTOS, so it will comfortably fit in a
uint8_t type.

In the RTOS implementation, you need to also bump up the version
number and add the priority parameter to the OSThread_start()
function. Here, you also need to re-design the use of the
OS_thread[] array.

As you perhaps recall from the last lesson 25, the OS_thread[]
array holds pointers to all thread objects that have been
started. In the round-robin scheduler, the array was filled
consecutively from index 0, which is reserved for the idle
thread, up to the index OS_threadNum.

For the priority-based scheduler, the difference will be that
indexes into the OS_thread[] array will be the thread PRIORITIES,
and that these priorities don't need to be consecutive, meaning
that there could be gaps in the OS_thread[] array. For example,
the picture shows the idle thread at priority 0, Blinky2 at
priority 2, Blinky1 at priority 5, and ThreadN at priority N.

This means that you replace the OS_threadNum index with the
priority number passed as the function parameter. Also, you need
to save the user-specified priority into the thread control block.

However, to avoid overbooking of the OS_thread[] array, you must
now make sure that not only the priority level is in range, but
also that it is not already used, meaning that OS_thread at the
prio index is still a zero-pointer.

You can move this assertion to the top of the function and make
it into a precondition that you code with the Q_REQUIRE() macro
introduced in the last lesson. Precondition means that it must be
satisfied by the CALLER of the function, not by the function
itself. For example, it is the job of the application programmer
to assign a unique priority to each thread.

Now, let's think for a minute about the use of the OS_readySet
bitmask. For the priority-based scheduling, the most interesting
information will be to find the highest-priority thread that is
ready-to-run based on the current value of the bitmask.

At this point, you need to decide about your preferred priority
numbering scheme.

For historical reasons, many RTOSes you might encounter, such as
NUCLEUS, ThreadX, MicroC/OS, embOS, and others, use the INVERSE
priority numbering scheme, where priority 0 corresponds to the
highest-priority thread and higher priority numbers correspond to
LOWER priority threads. Needless to say, the inverse priority
numbering scheme leads to constant confusion when talking about
higher and lower thread priorities.

But it is of course possible to use a simple, DIRECT priority
numbering scheme, where priority 0 corresponds to the idle thread
and higher priority numbers correspond to threads of higher
priority. The MiROS RTOS will use this simple, DIRECT priority
numbering convention.

The basic principle of finding the priority number of the
highest-priority thread that is ready-to-run will be to count the
leading zeros in the OS_readySet bitmask up to the first one-bit
and subtracting it from the total number of bits, which is 32.

For example, if the Blinky2 thread is the only one ready-to-run,
the OS_readySet bitmask would have all zero bits except bit
number 1. In this case, the count of leading zeros is 30, which
can be converted to the priority number by subtracting it from
the total number of bits of 32.

If the Blinky1 thread with priority 5 is ready-to-run, then the
OS_readySet bitmask will have 27 leading zeros up to the first
one-bit, which converts to the priority number of 5 by
subtracting it from the total number of bits of 32.

In the general case of a thread with priority N being
ready-to-run, the count of leading zeros will be 32 minus N, so
that the priority number works out again as 32 minus the count of
leading zeros in the OS_readySet bitmask.

The formula works even for the idle condition of the system,
where all 32 bits in the OS_readySet bitmask are zero, which
results in priority of zero, that is, the priority of the idle
thread.

Mathematically, the formula 32-CLZ(x) that finds the
most-significant one-bit in the number x, is the integer
approximation of the log-base-2 function. And that's why I will
code it as the LOG2() macro in the miros.c implementation file.

Of course, the algorithm is only as fast as the
Count-Leading-Zeros operation, but it turns out that your ARM
Cortex-M4 processor supports it in hardware, as the CLZ
instruction.

You can actually find this instruction in the datasheet of your
TivaC microcontroller.

To take advantage of the CLZ instruction, you can search the
compiler help for CLZ. As you can see, this particular compiler
supports it through the intrinsic function __clz().

With the very efficient LOG2() operation, you can now implement
the priority-based scheduler.

As before, when the scheduler detects the idle condition of the
system, it needs to choose the idle thread. But now, you no
longer use the OS_currIdx variable, so you just directly set
OS_next to OS_thread of zero, which is the idle thread.

Otherwise, if some threads are ready-to-run, you use the LOG2()
macro with the OS_readySet argument to find the highest-priority
thread that is ready-to-run. Please note that the LOG2() macro is
guaranteed to produce a number between zero and 32, inclusive, so
it can be used directly as an index into the OS_thread[] array
without range checking the index.

At this point, it is also a good idea to assert that the OS_next
pointer is not zero.

The final change is re-designing of the OS_tick() and OS_delay()
services. This is necessary, because both implementations
currently assume that the OS_thread[] array is populated
consecutively up to the OS_threadNum level, which is no longer
the case.

In OS_tick(), instead of iterating over the whole OS_thread[]
array with potentially big gaps of unused priority levels, you
can leverage the fast LOG2() operation.

Specifically, you can introduce a bitmask of delayed threads,
which is similar to the OS_readySet, except it holds the delayed
threads.

While you are at it, you might also remove the no longer needed
variables OS_currIdx and OS_threadNum.

With the OS_delayedSet bitmask in place, the OS_tick() function
will iterate only over the 1-bits in the bitmask, instead of
scanning all the bits. But, because you will need to remove the
already processed bits from the bitmask, you need to use a
temporary bitmask 'workingSet'.

You loop as long as the workingSet has some bits set, meaning
that some threads are delayed and need processing at this clock
tick.

You first quickly obtain the number of the highest-order 1-bit in
the workingSet using your fast LOG2() macro, and you use it to
index into the OS_therad[] array. You save the obtained thread
pointer in a temporary variable t.

You then assert that the t pointer is not zero, meaning that the
thread is in use.

And you also assert that the timeout of this thread is not zero,
because it must be a delayed thread.

Next, you decrement the timeout counter for this thread and if it
becomes zero, you make the thread ready to run by setting the
corresponding priority bit in the OS_readySet bitmask.

At the same time, you remove the same bit from the OS_delayedSet
bitmask, because this thread is no longer delayed.

Finally, you always remove the same priority bit from the
workingSet, because it is now processed.

To avoid the apparent code duplication, you can introduce a
temporary variable 'bit', like this.

In the OS_delay() function, you need to replace the OS_currIdx
variable with the priority number of the current thread.

In addition to removing the priority-bit from the ready set, the
function now also needs to add the same bit to the OS_delayedSet,
because this thread is now becoming delayed.

And finally, to avoid code duplication, you can introduce a
temporary variable 'bit', just like in OS_tick() before.

The priority-based scheduler implementation is ready now, so
let's try to build the code.

The project fails to compile, because the signature of the
OSThread_start() function has changed. With the priority-based
scheduler, each thread you start needs a unique priority to run at.

To remain consistent with the previously used diagrams, let's
assign priority 5 to Blinky1 and priority 2 to Blinky2.

As you can see, the priorities don't need to be consecutive. What
really matters is that they are unique and that priority of
Blinky1 is higher than priority of Blinky2.

There is still one more compilation error. You now need to add
explicit priority zero when starting the idle thread.

The code builds cleanly, so let's see how it works.

First, I'm sure you are curious about the code generated by the
LOG2() macro, so let's set a breakpoint inside the OS_sched()
function, where the macro is used.

As you can see, the disassembly starts with loading addresses of
the OS_thread array and the OS_readySet bitmask and then the
whole calculation of the highest-priority thread ready to run is
accomplished with just two instructions.

The CLZ instruction counts the number of leading zeros in the
bitmask and stores it in r0 in just one CPU cycle.

Similarly, the RSB instruction--reverse subtract--converts it to
the priority-number in r0, also in just one CPU cycle. The
resulting priority in this case turns out to be 5, which you
should recognize as the priority of Blinky1.

Indeed, the OS_next variable is set to the address of the Blinky1
thread.

When you finally run the code free you can watch some activity of
the LEDs on your board.

But to really see how your new priority-based scheduler works,
you need to use a logic analyzer.

As you can see, the Blinky1 thread runs now completely
undisturbed even when Blinky2 becomes ready to run.

Blinky1 always meets its deadline and so does Blinky2, even
though Blinky2 is preempted by Blinky1 several times.

Finally, the idle thread runs as well, but only when none of the
other threads or interrupts are active.

It's nice to notice that this analyzer trace matches exactly the
timeline of the priority-based scheduler you've designed earlier
and subsequently implemented in the MiROS RTOS.

At this point, I would like you to notice the absolutely critical
importance of BLOCKING for the priority-based scheduler. A
high-priority thread runs for as long as it wants and no thread
of lower-priority can run until the high-priority thread BLOCKS
and voluntarily relinquishes the CPU. Without the blocking, NO
lower-priority thread will EVER run. For example Blinky2 and the
Idle thread run only when Blinky1 blocks and similarly, the Idle
thread runs only when BOTH Blinky1 and Blinky2 are blocked.

This is very different from the round-robin scheduler, which
executed all your blinky threads in succession even when they
didn't block at all. So now you know why I had to implement
efficient thread blocking in the last lesson 25 before
introducing real-time priority-based scheduling in this lesson.

With the code working as expected, let's now focus on the most
important decision you face when working with the priority-based
scheduler, and that is, on how to assign priorities to your
threads.

With just two threads, like your Blinky1 and Blinky2, you have
only two possibilities: priority of Blinky1 higher than Blinky2
and priority of Blinky1 lower than Blinky2. As you just saw, the
first option meets both real-time deadlines.

On the other hand, it is easy to see that the second possibility
of assigning lower priority to Blinky1 than Blinky2 leads to
Blinky1 missing its deadline as soon as Bliny2 becomes ready to
run.

So, the rule that emerges here is to assign higher priorities to
threads with shorter periods, which means also shorter deadlines.

It turns out that this rule has been discovered already in the
1970s. Specifically, in 1973 C.L.Liu and James W. Layland
published a seminal paper titled: "Scheduling Algorithms for
Multiprogramming in Hard-Real-Time Environment". I provide a web
link to this article in the video description.

But basically, Liu and Layland described in this article that the
simple static priority-based scheduler that you've just
implemented can be mathematically proven to meet all
hard-real-time deadlines for all threads under certain conditions.

The method was later generalized and called "Rate-Monotonic
Analysis" (RMA) or "Rate-Monotonic Scheduling" (RMS), and has
been very well explained in another article "Rate Monotonic
Analysis for Real-Time Systems", to which I also provide a web
link in the video description.

I bring it up here, because no discussion of priority-based
scheduler can be complete without mentioning the RMA slash RMS
method.

The term rate monotonic (RM) derives from a method of assigning
priorities to a set of threads as a monotonic function of the
rate of a (periodic) thread.

A function is called "monotonic" in mathematics, when it
preserves (or reverses) the order between two ordered sets.
Specifically, RMA refers to priority assignment that maps
increasing order of thread rates to increasing order of thread
priorities. As such, it is just a fancy name for the simple rule
you have discovered yourself.

But there is of course more to RMA than this, and the full
discussion would be really out of scope for this short lesson.
For today, let me just summarize the most important guidelines of
the RMA/RMS method:

First, always assign thread priorities monotonically, meaning
that threads with higher rates must run at higher priorities than
threads with lower rates.

Second, you need to know the CPU utilization of each thread,
which you calculate as the ratio between the measured execution
time Cn to the period Tn.

And third, you need to calculate the total CPU utilization as the
sum of all individual CPU utilization factors. If this total
utilization is below the theoretical bound, all threads in the
set are guaranteed to meet their deadlines. This was already
proven mathematically by Liu and Layland back in the 1973 paper.

The utilization bound U(n) depends on the number of threads n.
For the large number of threads, U(n) approaches natural
logarithm of 2, which is just under 0.7. So, in practice, if you
stay below 70% of the CPU utilization, your set of threads will
be schedulable, meaning that they will all meet their deadlines.

For example, the total CPU utilization for your Blinky1 and
Blinky2 threads is as follows:

1.2ms/2ms for Blinky1 plus 3.6ms/54ms for Blinky2. The total CPU
utilization is thus 0.666, which is below the theoretical bound.

Of course, the basic RMA assumes periodic threads that execute in
constant time. But the method can be extended to aperiodic
threads with variable execution time. In that case you need to
consider the worst-case, that is the shortest time between the
thread activations and the longest execution time.

Also, in practice typically only a few highest-priority threads
have hard-real-time deadlines, while other threads have only
soft-real-time requirements. In that case, you use RMA for the
hard-real-time threads and prioritize all soft-real-time threads
lower.

The beauty of preemptive priority-based scheduling is that a
high-priority thread can always immediately preempt all
lower-priority threads, so a high-priority thread is insensitive
to changes in execution time or period of lower-priority threads.
In other words, the preemptive, priority-based scheduler
decouples threads in the time domain.

For all these reasons, the preemptive, priority-based scheduler
became the norm and is supported in most Real-Time Operating
Systems to this day.

This concludes this lesson on real-time computing and the
priority-based scheduling. In the next lesson, you'll advance by
another decade from the 1970's to the 1980's, when the commercial
RTOSes went mainstream, and when inter-thread synchronization and
communication mechanisms have been added.

If you like this channel, please subscribe to stay tuned. You can
also visit state-machine.com/quickstart for the class notes and
project file downloads.

Page 11, 9/8/2018 - 12:08:03 PM