Skip to content

Full Stack Tracing Part 1 What is it and when is it useful?

Full stack tracing is a tool that should be part of every software engineer’s toolkit. It’s the best way to investigate and solve certain classes of hard problems in optimization and debugging. Because of the power and capability it gives the developer, we’ll be writing a series of blogs about it: when to use it, how to get it set up, how to create traces, and how to interpret results. Our goal is to get you capable enough to use full stack tracing to solve your tough problems too.

Firstly, what is it? Full stack tracing is tracing on the full software stack, from the operating system to the application. By collecting profiling information (timing, process, caller, API, and other info) from the kernel, drivers, software frameworks, application, and JavaScript environments, you’re able to see exactly how the individual components of a system are interacting. That opens up areas of investigation that are impossible to achieve with standard application profilers, kernel debug messages, or even strategically inserted printf() commands. One way to think of full stack tracing is like a developer’s MRI machine that allows you to look into a running system without disturbing it to determine what is happening inside. (And unlike other low-level traces that we’ve written about before, full stack tracing provides a simpler way to view activity up and down the entire software stack.)

Use cases

When do you need full stack tracing? Since it gives you visibility across a running system, it’s a great solution anytime you need to do things like:

  • Gain visibility into the software running below the level of your application and API calls
  • Measure timing and synchronization of events that happen across multiple applications
  • Optimize application startup time, locating startup script inefficiencies, and trimming down system boot time
  • Verify communication between a client and server process, or any other process pairs like publisher and subscriber, or application and watchdog
  • Detect background processes that interfere with an application
  • Pinpoint places in the system that consume excessive CPU cycles
  • Find problematic interactions between multiple device drivers or independently running software stacks
  • Correctly understand the behavior of a complex system before you make bug fixes or alterations

Battling slow boot time

To put all of this in context, let’s walk through an example where we used full stack tracing to solve a show-stopping problem for a client. In this case, it was an automotive customer creating an infotainment system with a rather complex software stack. The main application was created in Qt and ran on a Nvidia Tegra X1, which is an extremely powerful processor for an embedded system. The GPU rivals that of laptops with a quad core CPU. Because the software required integration from many partners, developers unfortunately realized very late during integration that despite all of the power, the infotainment system was taking over five seconds to show the application’s first screen. This was completely unacceptable, and we were asked to diagnose why it was taking so long.

As much of the application was in QML, we first took a look at the code. While it certainly wasn’t crafted for speed – and they probably used far too much JavaScript code – this didn’t seem like the primary problem. We needed to determine if the boot issues were due to an overall sluggishness with too many pieces of software running, or if we could pinpoint the problems to one (or several) specific areas. Firing up the QML profiler highlighted something we didn’t expect: a trivial binding evaluation of a string to an image file in early startup took 180 milliseconds. What was happening underneath it that caused that binding to take so long?

QML profiler trace of binding issue

QML profiler showing long binding in action (similar issue, different app)

 

That’s a question that the QML profiler can’t answer. While you can make many educated guesses to figure out exactly what’s happening, it is this type of problem where full stack tracing is invaluable. It can tell you what Qt, C++, kernel, driver, or hardware interactions are happening underneath the QML so you can see where the time is being spent.

By turning on full stack tracing for the kernel, syscalls, and C++, adding tracepoints within Qt, and rerunning the application, we were able to find that those long bindings were occurring because many libraries were being loaded for image plug-in formats. Since the application only used PNG files, we were able to eliminate a huge number of these unnecessary loads for other file plug-ins. That included the SVG plugin, which was especially heavy. Although the application was a Qt Quick app that didn’t use Qt Widgets, the SVG plugin was pulling in the relatively big Qt Widgets library, which was unneeded. (This is because the plugin was compiled with QtWidgets support enabled.) But while eliminating the plug-ins made a dramatic improvement, it still wasn’t fast enough, and it was taking far longer to boot than it should.

Again, full stack tracing was able to show us that the disk I/O was a huge culprit and many applications were hammering the disk during startup. However, most of those applications accessing the disk were background tasks that weren’t critical to the startup procedure. Furthermore, most of the disk I/O was coming through a single call within syslog: the processes were emitting reams of debug output, going through syslog to write the data out to the flash drive. The CPU and GPU had more than enough horsepower to manage this, but the on-board flash was really cheap…and pretty slow.

While it might have been better to remove unnecessary logging from those background tasks, that change would have required a lot of coordination and testing because many distinct software suppliers were involved. That would have introduced a good deal of risk to a project that was already perilously close to the production deadline. Instead, we settled for bumping up the main application’s priority. The increased priority let the GUI beat out those background tasks during boot up. The background tasks were able to harmlessly finish their work slightly later. This change was now able to bring the splash screen up within one second, meeting the customer requirements to everyone’s great relief.

While full stack tracing was critical to the technical understanding of the problem, it was also necessary for the project management. With the high pressure of the impending deadline, engineers had to convincingly explain that the corrective actions being taken had a high degree of probable success. Full stack tracing indicated precisely what the problems were and exactly where in the software stack they were, avoiding any finger-pointing between suppliers.

Full stack trace of QML binding issue

QML binding shown with full stack tracing, highlighting SDA activity (similar issue, different app)

Until next time

Hopefully you’ll agree that full stack tracing could be a valuable skill to add to your repertoire. Stay tuned – our next blog will show you how to get it set up.

 

If you’d like to learn more about Debugging and Profiling, you might be interested in one of our training classes.  Visit this page to find out some of the areas and correlating tools with which we are used to working, and with which we can help you.

 

 

About KDAB

If you like this article and want to read similar material, consider subscribing via our RSS feed.

Subscribe to KDAB TV for similar informative short video content.

KDAB provides market leading software consulting and development services and training in Qt, C++ and 3D/OpenGL. Contact us.

7 thoughts on “Full Stack Tracing Part 1”

  1. Milian, thank you for the great post and the many past presentations and blog posts!

    Related to this post: Rico Mariani has written Holistic approaches to performance analysis [1] where he describes how to zoom out and investigate performance issues by looking at the whole system. There’s a lot be gained from that kind of approach, and I’m glad to see the tooling improve for Qt/Linux, in no small part thanks to KDAB.

    [1] https://docs.microsoft.com/en-us/archive/blogs/ricom/holistic-approaches-to-performance-analysis

    1. Milian Wolff

      Yes, ETW is a godsend and I would really like to have something similar on Linux. The power of that technology is also nicely exemplified by the blogs from Bruce Dawson e.g.

      And while I agree that the Linux landscape is improving, it is still a long road ahead, until we have something as powerful as UIforETW and WPA.

      Thanks for your comment!

      1. It might be of interest to you that you can load LTTNG Traces into WPA with the current WPA Preview which is part of the Windows ADK when you compile the Linux plugins for WPA with the WPA SDK. See https://github.com/microsoft/Microsoft-Performance-Tools-Linux. The plan is to bring the power of WPA to any tracing solution to open WPA for any trace file format including your own ones. This are exciting times. Great work on the LTTNG stuff. Compared to ETW Linux still seems to have a long way to go to capture the relevant events. Normally I need
        a) Context Switch Traces with Switch out call stack and ready call stack
        b) Profiling Sampling data with 1 kHz with Call Stacks
        c) Disk IO duration per file or directory (no stacks)
        d) Sampled Memory consumption from all processes (e.g. every 0.5s)
        e) Process start/end duration and command line

        I am still reading into LTTNG but as far as I understand it can deliver a-e with the exception of b) for which perf is still needed on top. Is this correct?
        Do you have a LTTNG Tracer call with this data providers enabled at hand? It would be interesting to see where the gaps are compared to ETW.

        1. Milian Wolff

          Hey, that’s really great news! Any chance to port WPA to Linux, such that we can run it there too? Otherwise it’s going to be relatively useless to me I’m afraid 😉 I won’t be rebooting into Windows just to look at some performance data.

          Regarding your question:
          You are right in your assessment – sampling requires the perf subsystem, but there’s an experimental LTTNG data source which integrates that. Though personally I’d rather like to see a)-e) implemented in terms of eBPF or perf directly. That should be doable, but we will need a new kind of kernel tracepoint or similar to allow memory consumption sampling.

          Regarding the list of useful lttng traces, have a look at part 2 of this series, i.e.:
          https://www.kdab.com/full-stack-tracing-part-2/

          1. WPA runs now on .NET Core 3.1. No it won’t be ported to Linux natively but you should get it up and running in Wine to some extent. If something is missing you can give feedback and hope that some UI glitches will be solved. I have the book of Brendan Greg (BPF Performance Tools) but to me it just looks like another way to skin the cat.

            What is the technical advantage of eBPF over LTTNG and is it really separate? LTTNG could also integrate eBPF to get everything into one place?
            So LTTNG gives me Context Switch Traces and related with Stack Traces (lttng add-context –kernel –type callstack-kernel \
            –type callstack-user) and I get in Trace Compass Full Stack traces like in WPA? I have tried that but it did not work out. Are there specific tricks necessary to get scheduler events with stack traces?

  2. Milian Wolff

    Ah cool, I have to try and see if one can get WPA to work under wine. Thanks for the suggestion!

    Regarding eBPF vs. LTTNG: The big difference would be that eBPF is upstream in the kernel, whereas LTTNG is a separate module and thus comes with an additional setup burden. Furthermore, at least in principle, eBPF has full access to the existing perf subsystem and already can do stack sampling (albeit only with framepointers, which is not good enough for more user space application setups). That said, you are completely right in saying that LTTNG could be rebased on top of eBPF to some degree.

    Regarding callstacks for context switches – no, with LTTNG you won’t get any call stacks associated to an event by default. I thought one could add that nowadays somehow too, but fail to find it right now – maybe it’s somewhere hidden in the lttng mailing list archives.

    If you “just” want to investigate off-CPU time, I can recommend hotspot’s off-cpu profiling capabilities based on perf instead: https://github.com/KDAB/hotspot#off-cpu-profiling

Leave a Reply

Your email address will not be published. Required fields are marked *