Learn how a developer used Tracealyzer to solve an issue with a randomly occurring reset.
With time-to-market pressures constantly on the rise, advanced visualization support is a necessity nowadays. For those who may be unfamiliar with Percepio, the company has sought out to accelerate embedded software development through world-leading RTOS tracing tools. Tracealyzer provides Makers, engineers and developers alike a new level of insight into the run-time world, allowing for improved designs, faster troubleshooting and higher performance. What has made it such a popular choice among the community is that it works with a wide-range of operating systems and is available for Linux and FreeRTOS, among several others.
When developing advanced multi-threaded software systems, a traditional debugger is often insufficient for understanding the behavior of the integrated system, especially regarding timing issues. Tracealyzer is able to visualize the run-time behavior through more than 20 innovative views that complement the debugger perspective. These views are interconnected in intuitive ways which makes the visualization system powerful and easy to navigate. Beyond that, it seamlessly integrates with Atmel Studio 6.2, providing optimized insight into the run-time of embedded software with advanced trace visualization.
Over the next couple of months, we will be sharing step-by-step tutorials from the Percepio team, collected directly from actual user experiences with Tracealyzer. Last week, we explored how to use the tools to solve an unexpected timing issue; today, we’re delving deeper into the mysterious random watchdog reset.
In this scenario, a developer had an issue with a randomly occurring reset. By placing a breakpoint in the reset exception handler, he had discovered that the watchdog timer had expired. The watchdog timer was supposed to be reset in a high priority task that executed periodically, however that was not the case.
The ability to insert custom User Events comes in handy here. They are similar to a classic “printf()” call and events that were added when the watchdog timer was reset and when it expired. User events also support data arguments, and this has been used to log the timer value (just before resetting it) to see the watchdog “margin,” i.e. remaining time. The result can be seen below, in the yellow text labels.
From this, we can see that the SamplerTask is running, but it does not clear the watchdog timer in the last execution of the task, which resets the system after a while (“Watchdog reset!”). So why didn’t SamplerTask reset the watchdog timer, you ask? Let’s enable kernel service calls to see what the task was doing.
The last event of SamplerTask is a call to xQueueSend, a kernel function that puts a message in a message queue. Note that the label is red, meaning that the xQueueSend call blocked the task, which caused a context-switch to ServerTask before the watchdog timer had been reset. This, of course, caused it to expire and reset the system.
Why was xQueueSend blocking the task? By double-clicking on this event label, we open the Object History View, showing all operations on this particular queue, “ControlQueue” — as illustrated below.
The rightmost column shows a visualization of the buffered messages. There, we can see that the message queue already contains five messages and is probably full, hence the blocking. But the ControlTask is supposed to read the queue and make room, so why hasn’t this worked as expected?
To investigate this bit further, it would be interesting to see how the watchdog margin varies over time. This information can be found in the user event logging, and by employing the User Event Signal Plot, we can plot the watchdog margin over time. Aside from that, by adding a CPU Load Graph on the same timeline, we can observe how the task execution affects the watchdog margin — as shown below.
In the CPU Load Graph, we can note that the ServerTask is executing a lot in the second half of the trace, and subsequently, this seems to impact the watchdog margin. ServerTask (bright green) has higher priority than ControlTask (dark green), so when it is executing a lot in the end of the trace, we see that ControlTask is getting less CPU time. This is an inherit effect of Fixed Priority Scheduling, which is used by most RTOS. Most likely, this could cause the full message queue, since ControlTask might not be able to read messages quick enough when the higher priority ServerTask is utilizing most of the CPU time. This is an example of a Priority Inversion problem, as the SamplerTask is blocked by an unrelated task of lower priority. A solution could be to change the scheduling priorities, meaning that the ControlTask gets higher priority than ServerTask. Let’s try that and visualize how it would look.
The screenshot above reveals the result of switching the task scheduling priorities between ServerTask and ControlTask. The system now shows a much more stable behavior. The CPU load of SamplerTask (highlighted in red) is quite steady around 20%, indicating a stable periodic behavior, while the watchdog margin is a perfect “line” — always at 10 ms. It does not expire anymore… problem solved! (Note that the task colors have changed due to the change in relative priority levels.)
Tracealyzer provides over 20 interactive views of the run-time world, connected in clever ways. It is available for several leading real-time operating systems as well as for Linux. This can be found on the company’s website here.