Full Stack Tracing Part 1 What is it and when is it useful?
Full stack tracing is a tool that should be part of every software engineer’s toolkit. It’s the best way to investigate and solve certain classes of hard problems in optimization and debugging. Because of the power and capability it gives the developer, we’ll be writing a series of blogs about it: when to use it, how to get it set up, how to create traces, and how to interpret results. Our goal is to get you capable enough to use full stack tracing to solve your tough problems too.
When do you need full stack tracing? Since it gives you visibility across a running system, it’s a great solution anytime you need to do things like:
- Gain visibility into the software running below the level of your application and API calls
- Measure timing and synchronization of events that happen across multiple applications
- Optimize application startup time, locating startup script inefficiencies, and trimming down system boot time
- Verify communication between a client and server process, or any other process pairs like publisher and subscriber, or application and watchdog
- Detect background processes that interfere with an application
- Pinpoint places in the system that consume excessive CPU cycles
- Find problematic interactions between multiple device drivers or independently running software stacks
- Correctly understand the behavior of a complex system before you make bug fixes or alterations
Battling slow boot time
To put all of this in context, let’s walk through an example where we used full stack tracing to solve a show-stopping problem for a client. In this case, it was an automotive customer creating an infotainment system with a rather complex software stack. The main application was created in Qt and ran on a Nvidia Tegra X1, which is an extremely powerful processor for an embedded system. The GPU rivals that of laptops with a quad core CPU. Because the software required integration from many partners, developers unfortunately realized very late during integration that despite all of the power, the infotainment system was taking over five seconds to show the application’s first screen. This was completely unacceptable, and we were asked to diagnose why it was taking so long.
That’s a question that the QML profiler can’t answer. While you can make many educated guesses to figure out exactly what’s happening, it is this type of problem where full stack tracing is invaluable. It can tell you what Qt, C++, kernel, driver, or hardware interactions are happening underneath the QML so you can see where the time is being spent.
By turning on full stack tracing for the kernel, syscalls, and C++, adding tracepoints within Qt, and rerunning the application, we were able to find that those long bindings were occurring because many libraries were being loaded for image plug-in formats. Since the application only used PNG files, we were able to eliminate a huge number of these unnecessary loads for other file plug-ins. That included the SVG plugin, which was especially heavy. Although the application was a Qt Quick app that didn’t use Qt Widgets, the SVG plugin was pulling in the relatively big Qt Widgets library, which was unneeded. (This is because the plugin was compiled with QtWidgets support enabled.) But while eliminating the plug-ins made a dramatic improvement, it still wasn’t fast enough, and it was taking far longer to boot than it should.
Again, full stack tracing was able to show us that the disk I/O was a huge culprit and many applications were hammering the disk during startup. However, most of those applications accessing the disk were background tasks that weren’t critical to the startup procedure. Furthermore, most of the disk I/O was coming through a single call within syslog: the processes were emitting reams of debug output, going through syslog to write the data out to the flash drive. The CPU and GPU had more than enough horsepower to manage this, but the on-board flash was really cheap…and pretty slow.
While it might have been better to remove unnecessary logging from those background tasks, that change would have required a lot of coordination and testing because many distinct software suppliers were involved. That would have introduced a good deal of risk to a project that was already perilously close to the production deadline. Instead, we settled for bumping up the main application’s priority. The increased priority let the GUI beat out those background tasks during boot up. The background tasks were able to harmlessly finish their work slightly later. This change was now able to bring the splash screen up within one second, meeting the customer requirements to everyone’s great relief.
While full stack tracing was critical to the technical understanding of the problem, it was also necessary for the project management. With the high pressure of the impending deadline, engineers had to convincingly explain that the corrective actions being taken had a high degree of probable success. Full stack tracing indicated precisely what the problems were and exactly where in the software stack they were, avoiding any finger-pointing between suppliers.
Until next time
Hopefully you’ll agree that full stack tracing could be a valuable skill to add to your repertoire. Stay tuned – our next blog will show you how to get it set up.
If you’d like to learn more about Debugging and Profiling, you might be interested in one of our training classes. Visit this page to find out some of the areas and correlating tools with which we are used to working, and with which we can help you.