“It’s not a feature, it’s a bug”


Embedded systems no longer need to be a ‘black box’ that leaves engineers guessing what may be happening, Percepio AB CEO Dr. Johan Kraft explains his latest guest blog post.


Anyone involved with software development will have most likely heard (and perhaps even said) the phrase “it’s not a bug, it’s a feature” at some point, and while its origins remain a mystery, its sentiment is clear — it’s a bug that we haven’t seen before.

connected_views

Intermittent ‘features’ in an embedded system can originate in either the software or hardware domain, often only evident when certain conditions collide in both. In the hardware domains, the timings involved may be parts of a nano second and where the logic is accessible, such as an address line or data bus — there exist instruments that can operate at high sample rates, allowing engineers to visualize and verify such ‘glitches.’ In the software domain, this becomes much more challenging.

Sequential Processing

While parallel processing is being rapidly adopted across all applications, single-processor systems remain common in embedded systems, thanks partly to the continued increases in the performance of microcontroller cores. Embedded MCUs are now capable of executing a range of increasingly sophisticated Real-Time Operating Systems (RTOS), often including the ability to run various communication protocols for both wired and wireless interfaces.

Whether in a single- or multi-processing system, combining these tasks with the embedded system’s main application, written by the engineering team, can make embedded software builds large, complex and difficult to fault-find, particularly when visibility into the code’s execution is limited. It can also lead to the dreaded intermittent fault which, if part of the system’s operation is ‘hidden’, can make solving them even more challenging.

A typical example may be an unexplained delay in a scheduled task. Of course, an RTOS is intended to guarantee specific tasks happen at specific times but this can be dependent on the task’s priority and what else may be happening at any time. In one real-world example, where a sensor needed to be sampled every 5ms, it was found that occasionally the delay between samples reached 6.5ms, with no simple explanation as to the cause. In another example, a customer reported that their system exhibited random resets; the suspected cause was that the watchdog was expiring before it was serviced, but how could this be checked? In yet another example, a system running a TCP/IP stack showed slower response times to network requests after minor changes in the code, for no obvious reason.

These are typical examples of how embedded systems running complex software can behave in unforeseen ways, leaving engineering teams speculating on the causes and attempting to solve the problems with only empirical results from which to assess their efforts. In the case of intermittent faults or system performance fluctuations, this is clearly an inefficient and unreliable development method.

Trace Tools

The use of logging software embedded in a build in order to record certain actions isn’t new, of course, and it can offer a significantly improved level of visibility into a system. However, while the data generated by such trace software is undoubtedly valuable, exploiting that value isn’t always simple.

Analyzing trace data and visually rendering it in various ways is the key function of Percepio’s Tracealyzer tools. It offers visualization at many levels, ranging from an event list to high-level dependency graphs and advanced statistics.

Over 20 different graphical views are provided, showing different aspects of the software’s execution that are unavailable with debuggers alone, and as such it complements existing software debug tools in a way that is becoming essential in today’s complex embedded systems. It supports an increasing range of target operating systems.

Figure 1(a): It appears that the ControlTask may be disabling interrupts.

Figure 1(a): It appears that the ControlTask may be disabling interrupts.

The main view in Tracealyzer, as shown in Figure 1(a) and 1(b), is a vertical timeline visualizing the execution of tasks/threads and interrupts. Other logged events, such as system calls, are displayed as annotations in this timeline, using horizontal colour-coded text labels. Several other timeline views are provided using horizontal orientation and all horizontal views can be combined on a common horizontal timeline. While much important data is created by the operating system’s kernel, developers can also extend the tracing with User Events, which allow any event or data in a user’s application to be logged. They are logged similar to calling the classic ‘printf’ C library function but are much faster as the actual formatting is handled in the host-side application, and can therefore also be used in time-critical code such as interrupt handlers. And, of course, they can also be correlated with other kernel-based events.

Figure 1(b): By changing the way ControlTask protects a critical section, SamplerTask is able to run as intended.

Figure 1(b): By changing the way ControlTask protects a critical section, SamplerTask is able to run as intended.

Tracealyzer understands the general meaning of many kernel calls, for instance locking a Mutex or writing to a message queue. This allows Tracealyzer to perform deep analysis to connect related events and visualize dependencies, e.g., which tasks communicate (see the communication flow graph, shown in Figure 3). This allows developers to quickly understand what’s really going on inside their system.

Insights

Returning to the first example, where a scheduled task was being inexplicably delayed intermittently, Tracealyzer was used to graphically show the task in question, time-correlated with other tasks. By invoking an exploded view of the task of interest, it was found that a lower priority task was incorrectly blocking the primary task from executing. It was discovered that the second task was disabling interrupts to protect a critical section unrelated to the primary task, which blocked the operating system scheduling. After changing the second task to using a Mutex instead, the primary task was able to meet its timing requirements. Figure 1(a) shows the SamplerTask being delayed by the (lower priority) ControlTask before the bug fix; Figure 1(b) confirms that SamplerTask is now occurring every 5ms as intended.

In the second example, User Events were used to not only record when the Watchdog was reset or when it expired, but also to log the remaining Watchdog timer value, thereby showing the time left in the Watchdog timer when it is reset. By inspecting the logged system calls it was found that the task in question did not only reset the Watchdog timer; it also posted a message to another task using a (fixed-size) message queue. The Watchdog resets seemed to occur while the Watchdog task was blocked by this message posting. Once realised, the question then became ‘why’. By visually exploring the operations on this message queue using the Kernel Object History view, it became clear that the message queue sometimes becomes full, as suspected. By correlating a view of the CPU load against how the Watchdog timer margin varied over time, as shown in Figure 2, it was found that Fixed Priority Scheduling was allowing a medium-priority task (ServerTask) to use so much CPU time that the message queue wasn’t always being read. Instead, it became full, leading to a Watchdog reset. The solution was in this case to modify the task priorities.

Figure 2: The CPU Load graph, correlated to the Watchdog Timer User Event, gives valuable insights.

Figure 2: The CPU Load graph, correlated to the Watchdog Timer User Event, gives valuable insights.

In the last example, where a software modification caused increased response time to network requests, using the Communications Flow view (Figure 3) it was found that one particular task — Logger — was receiving frequent but single messages with diagnostics data to be written to a device file system, each causing a context switch. By modifying the task priorities, the messages were instead buffered until the network request had finished and thereafter handled in a batch. This way, the number of context-switches during the handling of network requests was drastically reduced, thereby improving overall system responsiveness.

Figure 3: The Communication Flow reveals 5 tasks sending messages to Logger.

Figure 3: The Communication Flow reveals 5 tasks sending messages to Logger.

Conclusion

The complexity of embedded software is increasing rapidly, creating demand for improved development tools. While runtime data can be recorded in various ways, understanding its meaning isn’t a simple process, but through the use of innovative data visualization tools such as Tracealyzer it can be.

Many companies have already benefited from the many ways of using the tool to really discover what’s going on in the runtime system. Some Tracealyzer users even include it in production code, allowing them to gather invaluable data about real systems running in the field.

Embedded systems need no longer be a ‘black box,’ leaving engineers to suppose what may be happening; powerful visualization tools now turn that black box into an open box.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s