Full Stack Tracing Part 1

What is it and when is it useful?

28 September 2020

Use cases

When do you need full stack tracing? Since it gives you visibility across a running system, it’s a great solution anytime you need to do things like:

Gain visibility into the software running below the level of your application and API calls
Measure timing and synchronization of events that happen across multiple applications
Optimize application startup time, locating startup script inefficiencies, and trimming down system boot time
Verify communication between a client and server process, or any other process pairs like publisher and subscriber, or application and watchdog
Detect background processes that interfere with an application
Pinpoint places in the system that consume excessive CPU cycles
Find problematic interactions between multiple device drivers or independently running software stacks
Correctly understand the behavior of a complex system before you make bug fixes or alterations

Battling slow boot time

To put all of this in context, let’s walk through an example where we used full stack tracing to solve a show-stopping problem for a client. In this case, it was an automotive customer creating an infotainment system with a rather complex software stack. The main application was created in Qt and ran on a Nvidia Tegra X1, which is an extremely powerful processor for an embedded system. The GPU rivals that of laptops with a quad core CPU. Because the software required integration from many partners, developers unfortunately realized very late during integration that despite all of the power, the infotainment system was taking over five seconds to show the application’s first screen. This was completely unacceptable, and we were asked to diagnose why it was taking so long.

As much of the application was in QML, we first took a look at the code. While it certainly wasn’t crafted for speed – and they probably used far too much JavaScript code – this didn’t seem like the primary problem. We needed to determine if the boot issues were due to an overall sluggishness with too many pieces of software running, or if we could pinpoint the problems to one (or several) specific areas. Firing up the QML profiler highlighted something we didn’t expect: a trivial binding evaluation of a string to an image file in early startup took 180 milliseconds. What was happening underneath it that caused that binding to take so long?

QML profiler showing long binding in action (similar issue, different app)

That’s a question that the QML profiler can’t answer. While you can make many educated guesses to figure out exactly what’s happening, it is this type of problem where full stack tracing is invaluable. It can tell you what Qt, C++, kernel, driver, or hardware interactions are happening underneath the QML so you can see where the time is being spent.

By turning on full stack tracing for the kernel, syscalls, and C++, adding tracepoints within Qt, and rerunning the application, we were able to find that those long bindings were occurring because many libraries were being loaded for image plug-in formats. Since the application only used PNG files, we were able to eliminate a huge number of these unnecessary loads for other file plug-ins. That included the SVG plugin, which was especially heavy. Although the application was a Qt Quick app that didn’t use Qt Widgets, the SVG plugin was pulling in the relatively big Qt Widgets library, which was unneeded. (This is because the plugin was compiled with QtWidgets support enabled.) But while eliminating the plug-ins made a dramatic improvement, it still wasn’t fast enough, and it was taking far longer to boot than it should.

Again, full stack tracing was able to show us that the disk I/O was a huge culprit and many applications were hammering the disk during startup. However, most of those applications accessing the disk were background tasks that weren’t critical to the startup procedure. Furthermore, most of the disk I/O was coming through a single call within syslog: the processes were emitting reams of debug output, going through syslog to write the data out to the flash drive. The CPU and GPU had more than enough horsepower to manage this, but the on-board flash was really cheap…and pretty slow.

While it might have been better to remove unnecessary logging from those background tasks, that change would have required a lot of coordination and testing because many distinct software suppliers were involved. That would have introduced a good deal of risk to a project that was already perilously close to the production deadline. Instead, we settled for bumping up the main application’s priority. The increased priority let the GUI beat out those background tasks during boot up. The background tasks were able to harmlessly finish their work slightly later. This change was now able to bring the splash screen up within one second, meeting the customer requirements to everyone’s great relief.

While full stack tracing was critical to the technical understanding of the problem, it was also necessary for the project management. With the high pressure of the impending deadline, engineers had to convincingly explain that the corrective actions being taken had a high degree of probable success. Full stack tracing indicated precisely what the problems were and exactly where in the software stack they were, avoiding any finger-pointing between suppliers.

QML binding shown with full stack tracing, highlighting SDA activity (similar issue, different app)

Until next time

Hopefully you’ll agree that full stack tracing could be a valuable skill to add to your repertoire. Stay tuned – our next blog will show you how to get it set up.

If you'd like to learn more about Debugging and Profiling, you might be interested in one of our training classes. Visit this page to find out some of the areas and correlating tools with which we are used to working, and with which we can help you.

Tags:

c++performance qml tools

About KDAB

The KDAB Group is a globally recognized provider for software consulting, development and training, specializing in embedded devices and complex cross-platform desktop applications. In addition to being leading experts in Qt, C++ and 3D technologies for over two decades, KDAB provides deep expertise across the stack, including Linux, Rust and modern UI frameworks. With 100+ employees from 20 countries and offices in Sweden, Germany, USA, France and UK, we serve clients around the world.

7 Comments

Milian, thank you for the great post and the many past presentations and blog posts!

Related to this post: Rico Mariani has written Holistic approaches to performance analysis [1] where he describes how to zoom out and investigate performance issues by looking at the whole system. There's a lot be gained from that kind of approach, and I'm glad to see the tooling improve for Qt/Linux, in no small part thanks to KDAB.

[1] https://docs.microsoft.com/en-us/archive/blogs/ricom/holistic-approaches-to-performance-analysis

Yes, ETW is a godsend and I would really like to have something similar on Linux. The power of that technology is also nicely exemplified by the blogs from Bruce Dawson e.g.

And while I agree that the Linux landscape is improving, it is still a long road ahead, until we have something as powerful as UIforETW and WPA.

Thanks for your comment!

Just read Mark Russinovich statement that eBPF-based Sysmon for Linux is coming as preview in February. Exiting times!

https://twitter.com/markrussinovich/status/1340737856201879552

Thank you for your work on tracing and looking forward to more articles!

It might be of interest to you that you can load LTTNG Traces into WPA with the current WPA Preview which is part of the Windows ADK when you compile the Linux plugins for WPA with the WPA SDK. See https://github.com/microsoft/Microsoft-Performance-Tools-Linux. The plan is to bring the power of WPA to any tracing solution to open WPA for any trace file format including your own ones. This are exciting times. Great work on the LTTNG stuff. Compared to ETW Linux still seems to have a long way to go to capture the relevant events. Normally I need a) Context Switch Traces with Switch out call stack and ready call stack b) Profiling Sampling data with 1 kHz with Call Stacks c) Disk IO duration per file or directory (no stacks) d) Sampled Memory consumption from all processes (e.g. every 0.5s) e) Process start/end duration and command line

I am still reading into LTTNG but as far as I understand it can deliver a-e with the exception of b) for which perf is still needed on top. Is this correct? Do you have a LTTNG Tracer call with this data providers enabled at hand? It would be interesting to see where the gaps are compared to ETW.

Hey, that's really great news! Any chance to port WPA to Linux, such that we can run it there too? Otherwise it's going to be relatively useless to me I'm afraid ;-) I won't be rebooting into Windows just to look at some performance data.

Regarding your question: You are right in your assessment - sampling requires the perf subsystem, but there's an experimental LTTNG data source which integrates that. Though personally I'd rather like to see a)-e) implemented in terms of eBPF or perf directly. That should be doable, but we will need a new kind of kernel tracepoint or similar to allow memory consumption sampling.

Regarding the list of useful lttng traces, have a look at part 2 of this series, i.e.: https://www.kdab.com/full-stack-tracing-part-2/

WPA runs now on .NET Core 3.1. No it won't be ported to Linux natively but you should get it up and running in Wine to some extent. If something is missing you can give feedback and hope that some UI glitches will be solved. I have the book of Brendan Greg (BPF Performance Tools) but to me it just looks like another way to skin the cat.

What is the technical advantage of eBPF over LTTNG and is it really separate? LTTNG could also integrate eBPF to get everything into one place? So LTTNG gives me Context Switch Traces and related with Stack Traces (lttng add-context --kernel --type callstack-kernel \ --type callstack-user) and I get in Trace Compass Full Stack traces like in WPA? I have tried that but it did not work out. Are there specific tricks necessary to get scheduler events with stack traces?

Ah cool, I have to try and see if one can get WPA to work under wine. Thanks for the suggestion!

Regarding eBPF vs. LTTNG: The big difference would be that eBPF is upstream in the kernel, whereas LTTNG is a separate module and thus comes with an additional setup burden. Furthermore, at least in principle, eBPF has full access to the existing perf subsystem and already can do stack sampling (albeit only with framepointers, which is not good enough for more user space application setups). That said, you are completely right in saying that LTTNG could be rebased on top of eBPF to some degree.

Regarding callstacks for context switches - no, with LTTNG you won't get any call stacks associated to an event by default. I thought one could add that nowadays somehow too, but fail to find it right now - maybe it's somewhere hidden in the lttng mailing list archives.

If you "just" want to investigate off-CPU time, I can recommend hotspot's off-cpu profiling capabilities based on perf instead: https://github.com/KDAB/hotspot#off-cpu-profiling

Milian Wolff

Senior Software Engineer

Milian Wolff has a long history of creating tools for C++ developers. He’s the main author of Massif-Visualizer, heaptrack, hotspot and ctf2ctf tools now used widely to improve C++ applications performance. He’s a Senior Software Engineer at KDAB where he enjoys solving hard performance problems and teaching developers about debugging and profiling tools. Milian has a Masters Degree in Physics.

Full Stack Tracing Part 1

What is it and when is it useful?

Use cases

Battling slow boot time

Until next time

7 Comments

Related Content

Sign up for the KDAB Newsletter

Full Stack Tracing Part 1

What is it and when is it useful?

Use cases

Battling slow boot time

Until next time

7 Comments

Related Content

10 Tips to Make Your QML Code Faster and More Maintainable

Introducing the ConnectionEvaluator in KDBindings

Fast Duplicate Tracking

Sign up for the KDAB Newsletter