Update to Linux perf report Improved handling of inlined frames

28.11.2017

Milian Wolff

No comments

Linux perf is an immensely useful and powerful tool suite for profiling of C/C++ applications. I have used it extensively and successfully on various customer projects, both for desktop applications as well as automotive or industrial projects targeting low-end embedded Linux targets running on ARM hardware.

The biggest problem with perf really is its usability, or lack thereof. Over the years I have tried my best in educating people on how to use the tool suite, both in form of conference talks as well as during KDABs Linux profiling trainings and workshops and, most recently, by creating hotspot.

Additionally, I have started contributing upstream to the Linux kernel repository to alleviate some of the problems. A short time ago, one important patch series of mine got merged and will be part of Linux 4.15, which drastically improves Linux perf usability when dealing with common C++ workloads.

The patch series completely reworked the representation and handling of inlined frames in perf report.

Up until recently, inlined frames were not shown at all in the report. This made the interpretation of reports from C++ applications very hard, since the compiler will inline a lot of templates.

Take the following contrived example code snippet. It is intentionally written with bad performance characteristics, but conceptually gives a good representation of the usability issues of perf report with common C++ code:

#include <vector>
#include <complex>
#include <random>
#include <iostream>
#include <cmath>
#include <algorithm>

using namespace std;

namespace {
vector<double> generateVector(size_t entries)
{
    uniform_real_distribution<double> uniform(-1E5, 1E5);
    default_random_engine engine;
    vector<double> vec;
    for (size_t i = 0; i < entries; ++i) {
        vec.push_back(norm(complex<double>(uniform(engine), uniform(engine))));
    }
    return vec;
}
}

int main()
{
    auto vec = generateVector(5000000);
    std::sort(vec.begin(), vec.end());
    std::cout << "median is:" << vec[vec.size() / 2] << '\n';
    return 0;
}

Now compile it with debug symbols and compiler optimizations enabled. Then use perf record for sampling based CPU profiling and enable stack trace collection:

$ g++ -O2 -g inline.cpp -o inline
$ perf record --call-graph dwarf ./inline
median is:6.36103e+09
[ perf record: Woken up 24 times to write data ]
[ perf record: Captured and wrote 5.819 MB perf.data (721 samples) ]

Then inspect the data with perf report. First, let’s have a look at an excerpt from the top-down aggregated call graph costs:

$ perf report
...
-   97.36%    11.37%  inline   inline            [.] main
   - 85.99% main
      + 56.17% std::__introsort_loop<__gnu_cxx::__normal_iterator<double*, std::vector<double, std::allocator<double> > >, long, __gnu_cxx::__ops::_Iter_less_iter>
      + 15.53% hypot
        12.21% std::generate_canonical<double, 53ul, std::linear_congruential_engine<unsigned long, 16807ul, 0ul, 2147483647ul> >
      + 1.53% std::vector<double, std::allocator<double> >::_M_realloc_insert<double>
   + 10.54% _start
   + 0.83% 0x7ffcb296f33f
...

The data tells us that main has a self cost of 11.37%. Inclusively, the cost attributed to main is at 97.36%, most of which come from calling std::__introsort_loop, hypot and std::generate_canonical. Do you spot an issue with that? Neither of the three functions are called directly from within main, rather their callers have all been inlined into main. Additionally, from a code point of view, main should not have a high self cost – it only calls other functions that actually do the work. As such it is confusing to novices where the large fraction of 11.37% self cost comes from. Again, the answer is inlining.

Previous work by others added the --inline switch to perf report which partially improved this situation somewhat:

$ perf report --inline
-   97.36%    11.37%  inline   inline            [.] main
   + 85.99% main
      main (inline)
      generateVector (inline)
      std::uniform_real_distribution<double>::operator()<std::linear_congruential_engine<unsigned long, 16807ul, 0ul, 2147483647ul> > (inline)
      std::uniform_real_distribution<double>::operator()<std::linear_congruential_engine<unsigned long, 16807ul, 0ul, 2147483647ul> > (inline)
      std::__detail::_Adaptor<std::linear_congruential_engine<unsigned long, 16807ul, 0ul, 2147483647ul>, double>::operator() (inline)

With that, we now see some inlined frames. But these frames are handled completely different to non-inlined frames. For example, there is no way to get aggregated cost numbers for inlined frames. Additionally, it uncovers yet more subtle usability issues as shown by the excerpt above: Due to the aggregation scheme, only a single inlined call stack gets printed. In reality, multiple functions got inlined into main and all should be shown. I.e. where are std::__introsort_loop, hypot etc.?

What my patch series does

My just landed patch series resolves these shortcomings. What is more, it adds some caches for the inline data on top, which drastically improves the performance for perf report when looking for inlined frames and source file and line pairs. This allows us to now include inlined frames by default in reports, i.e. you do not need to pass --inline explicitly anymore. The result becomes:

$ perf report
-   97.36%    11.37%  inline   inline            [.] main
   - 85.99% main
      - 56.17% std::sort<__gnu_cxx::__normal_iterator<double*, std::vector<double, std::allocator<double> > > > (inlined)
           std::__sort<__gnu_cxx::__normal_iterator<double*, std::vector<double, std::allocator<double> > >, __gnu_cxx::__ops::_Iter_less_iter> (inlined)
         + std::__introsort_loop<__gnu_cxx::__normal_iterator<double*, std::vector<double, std::allocator<double> > >, long, __gnu_cxx::__ops::_Iter_less_iter>
      - 29.82% generateVector (inlined)
         + 16.09% std::norm<double> (inlined)
         + 12.21% std::uniform_real_distribution<double>::operator()<std::linear_congruential_engine<unsigned long, 16807ul, 0ul, 2147483647ul> > (inlined)
         + 1.53% std::vector<double, std::allocator<double> >::push_back (inlined)

Note how multiple branches of inlined frames in main are shown. Additionally, we now aggregate costs for inlined frames, too.

The -g srcline capability was also contributed by me. It tells perf report to show the source file and line number in the call stacks below one of the top-level report entries:

$ perf report -s sym -g srcline
-   97.36%    11.37%  [.] main
   - 56.17% main inline.cpp:26
        std::sort<__gnu_cxx::__normal_iterator<double*, std::vector<double, std::allocator<double> > > > stl_algo.h:4836 (inlined)
      + std::__sort<__gnu_cxx::__normal_iterator<double*, std::vector<double, std::allocator<double> > >, __gnu_cxx::__ops::_Iter_less_iter> stl_algo.h:1968 (inlined)
   - 29.82% main inline.cpp:25
      - generateVector inline.cpp:17 (inlined)
         - 16.09% std::norm<double> complex:664 (inlined)
              std::_Norm_helper<true>::_S_do_it<double> complex:654 (inlined)
              std::abs<double> complex:597 (inlined)
            - std::__complex_abs complex:589 (inlined)
               + 15.53% hypot

Paired with the big change series that improves the handling of inlined frames, this perf report invocation nicely shows us which lines of code contribute the most to the statistical CPU cycle costs.

Overall I hope you enjoy these improvements as much as I do. If you still have trouble interpreting the reports from Linux perf, consider using the hotspot GUI. If you otherwise need help with Linux perf in general, or profiling on embedded Linux in general, contact KDAB for trainings or workshops on these topics!

PS: Let’s have a quick look at the call stack to hypot. It also allows us to figure out where this call comes from, which also showed up in the initial report excerpt that did not include inlined frames: Contrary to my naive initial assumption, std::norm of a complex number a + bi is not implemented as a trivial a^2 + b^2 computation. Rather, due to numerical stability and error handling, it squares the result of std::abs, which internally calls cabs which then finally calls hypot. Note that enabling -ffast-math changes this, and the fast, naive formula will be used. Overall, this means that std::norm of a complex number is usually slightly slower than calling std::abs directly – contrary to my gut feeling and existing documentation.

Categories: C++ / KDAB Blogs / KDAB on Qt / Profiling & Debugging / Tooling