Four Habit-Forming Tips to Faster C++

21 July 2016

1) Make use of the (named-) return-value optimisation

According to Lawrence Crowl, (named-) return-value optimisation ((N)RVO) is one of the most important optimisations in modern C++. Okay—what is it?

Let’s start with plain return-value optimization (RVO). Normally, when a C++ method returns an unnamed object, the compiler creates a temporary object, which is then copy-constructed into the target object.

MyData myFunction() {
    return MyData(); // Create and return unnamed obj
}

MyData abc = myFunction();

With RVO, the C++ standard allows the compiler to skip the creation of the temporary, treating both object instances—the one inside the function and the one assigned to the variable outside the function—as the same. This usually goes under the name of copy elision. But what is elided here is the temporary and the copy.

So, not only do you save the copy constructor call, you also save the destructor call, as well as some stack memory. Clearly, elimination of extra calls and temporaries saves time and space, but crucially, RVO is an enabler for pass-by-value designs. Imagine MyData was a large million-by-million matrix. There mere chance that some target compiler could fail to implement this optimisation would make every good programmer shy away from return-by-value and resort to out parameters instead (more on those further down).

As an aside: don't C++ Move Semantics solve this? The answer is: no. If you move instead of copy, you still have the temporary and its destructor call in the executable code. And if your matrix is not heap-allocated, but statically sized, such as a std::array<std::array<double, 1000>, 1000>>, moving is the same as copying. With RVO, you mustn't be afraid of returning by value. You must unlearn what you have learned and embrace return-by-value.

Named Return Value Optimization is similar but it allows the compiler to eliminate not just rvalues (temporaries), but lvalues (local variables), too, under certain conditions.

What all compilers these days (and for some time now) reliably implement is NRVO in the case where there is a single variable that is passed to every return, and declared at function scope as the first variable:

MyData myFunction() {
    MyData result;           // Declare return val in ONE place

    if (doing_something) {
        return result;       // Return same val everywhere
    }
    // Doing something else
    return result;           // Return same val everywhere
}

MyData abc = myFunction();

Sadly, many compilers, including GCC, fail to apply NRVO when you deviate even slightly from the basic pattern:

MyData myFunction() {
    if (doing_something)
        return MyData();     // RVO expected

    MyData result;

    // ...
    return result;           // NRVO expected
}

MyData abc = myFunction();

At least GCC fails to use NRVO for the second return statement in that function. The fix in this case is easy (go back to the first version), but it's not always that easy. It is an altogether sad state of affairs for a language that is said to have the most advanced optimisers available to it for compilers to fail to implement this very basic optimisation.

So, for the time being, get your fingers accustomed to typing the classical NRVO pattern: it enables the compiler to generate code that does what you want in the most efficient way enabled by the C++ standard.

If diving into assembly code to check whether a particular patterns makes your compiler drop NRVO isn't your thing, Thomas Brown provides a very comprehensive list of compilers tested for their NRVO support and I’ve extended Brown’s work with some additional results.

If you start using the NVRO pattern but aren’t getting the results you expect, your compiler may not automatically perform NRVO transformations. You may need to check your compiler optimization settings and explicitly enable them.

2) Return parameters by value whenever possible

This is pretty simple: don’t use “out-parameters”. The result for the caller is certainly kinder: we just return our value instead of having the caller allocate a variable and pass in a reference. Even if your function returns multiple results, nearly all of the time you’re much better off creating a small result struct that the function passes back (via (N)RVO!):

That is, instead of this:

void convertToFraction(double val, int &amp;numerator, int &amp;denominator) {
    numerator = /*calculation */ ;
    denominator = /*calculation */ ;
}

int numerator, denominator;
convertToFraction(val, numerator, denominator); // or was it &quot;denominator, nominator&quot;?
use(numerator);
use(denominator);

You should prefer this:

struct fractional_parts {
    int numerator;
    int denominator;
};

fractional_parts convertToFraction(double val) {
    int numerator = /*calculation */ ;
    int denominator = /*calculation */ ;
    return {numerator, denominator}; // C++11 braced initialisation -&gt; RVO
}

auto parts = convertToFraction(val);
use(parts.numerator);
use(parts.denominator);

This may seem surprising, even counter-intuitive, for programmers that cut their teeth on older x86 architectures. You’re just passing around a pointer instead of a big chunk of data, right? Quite simply, “out” parameter pointers force a modern compiler to avoid certain optimisations when calling non-inlined functions. Because the compiler can’t always determine if the function call may change an underlying value (due to aliasing), it can’t beneficially keep the value in a CPU register or reorder instructions around it. Besides, compilers have gotten pretty smart—they don’t actually do expensive value passing unless they need to (see the next tip). With 64-bit and even 32-bit CPUs, small structs can be packed into registers or automatically allocated on the stack as needed by the compiler. Returning results by value allows the compiler to understand that there isn’t any modification or aliasing happening to your parameters, and you and your callers get to write simpler code.

3) Cache member-variables and reference-parameters

This rule is straightforward: take a copy of the member-variables or reference-parameters you are going to use within your function at the top of the function, instead of using them directly throughout the method. There are two good reasons for this.

The first is the same as the tip above—because pointer references (even member-variables in methods, as they're accessed through the implicit this pointer) put a stick in the wheels of the compiler’s optimisation. The compiler can’t guarantee that things don’t change outside its view, so it takes a very conservative (and in most cases wasteful) approach and throws away any state information it may have gleaned about those variables each time they’re used anew. And that’s valuable information that can help the compiler eliminate instructions and references to memory.

Another important reason is correctness. As an example provided by Lawrence Crowl in his CppCon 2014 talk “The Implementation of Value Types”, instead of this complex number multiplication:

template &lt;class T&gt;
complex&lt;T&gt; &amp;complex&lt;T&gt;::operator*=(const complex&lt;T&gt; &amp;a) {
   T r = real * a.real – imag * a.imag;
   imag = real * a.imag + imag * a.real;
   real = r;
   return *this;
}

You should prefer this version:

template &lt;class T&gt;
complex&lt;T&gt; &amp;complex&lt;T&gt;::operator*=(const complex&lt;T&gt; &amp;a) {
   T a_real = a.real, a_imag = a.imag;
   T t_real =   real, t_imag =   imag; // t == this
   real = t_real * a_real – t_imag * a_imag;
   imag = t_real * a_imag + t_imag * a_real;
   return *this;
}

This second, non-aliased version will still work properly if you use value *= value to square a number; the first one won’t give you the right value because it doesn’t protect against aliased variables.

To summarise succinctly: read from (and write to!) each non-local variable exactly once in every function.

4) Organize your member variables intelligently

Is it better to organize member variables for readability or for the compiler? Ideally, you pick a scheme that works for both.

And now is a perfect time for a short refresher about CPU caches. Of course data coming from memory is very slow compared to data coming from a cache. An important fact to remember is that data is loaded into the cache in (typically) 64-byte blocks called cache lines. The cache line—that is, your requested data and the 64 bytes surrounding it—is loaded on your first request for memory absent in the cache. Because every cache miss silently penalises your program, you want a well-considered strategy for ensuring you reduce cache misses whenever possible. Even if the first memory access is outside the cache, trying to structure your accesses so that a second, third, or forth access is within the cache will have a significant impact on speed. With that in mind, consider these tips for your member-variable declarations:

Move the most-frequently-used member-variables first
Move the least-frequently-used member-variables last
If variables are often used together, group them near each other
Try to reference variables in your functions in the order they’re declared
Keep an eye out on alignment requirements of member-variables, lest you waste space on padding

Nearly all C++ compilers organize member variables in memory in the order in which they are declared. And grouping your member variables using the above guidelines can help reduce cache misses that drastically impact performance. Although compilers can be smart about creating code that works with caching strategies in a way that’s hard for humans to track, the C++ rules on class layout make it hard for compilers to really shine. Your goal here is to help the compiler by stacking the deck on cache-line loads that will preferentially load the variables in the order you’ll need them.

This can be a tough one if you’re not sure how frequently things are used. While it’s not always easy for complicated classes to know what member variables may be touched more often, generally following this rule of thumb as well as you can will help. Certainly for the simpler classes (string, dates/times, points, complex, quaternions, etc) you’ll probably be accessing most member variables most of the time, but you can still declare and access your member variables in a consistent way that will help guarantee that you’re minimizing your cache misses.

Conclusion

The bottomline is that it still takes some amount of hand-holding to get a compiler to generate the best code. Good coding-habits are by no means the end-all, but are certainly a great place to start.

Tags:

c++

18 Comments

Interesting article. I am a bit surprised by rule 3 "Cache member-variables and reference-parameters". What is the point of passing a parameter by const reference if a copy is made anyway ? Also, what is variable t in the second version of the example ? It feels like "t." should have been "this->"

What is the point of passing a parameter by const reference if a copy is made anyway ?

In this case, passing in a by value would have been preferable, yes. Simple examples are usually too simple to be free of trivialities like that. In the general case, you might want to pass by cref to centralise the copy in a non-inline function, instead of cluttering client code with copy ctor calls, or you are not accessing all the variables of the passed object, only some.

Also, it might not suffice to take a copy of the parameter. If you copy a std::vector inside the function, you're likely doing something wrong. But if you access the first element of it repeatedly, you should cache it in a temporary.

Also, what is variable t in the second version of the example ? It feels like “t.” should have been “this->”

You are correct. Will be fixed shortly.

I'm looking at the instructions generated by both versions of the function/operator, which I have named func1() and func2(). I am far from being an expert in assembly but both versions seem to generate the same number of instructions and memory references. https://godbolt.org/g/rX8o0R

It would be very interesting to have some benchmark results showing the improvements gained by caching non-local reference, if you have any available.

Even though probably not what you were looking for, I find https://codereview.qt-project.org/106093 particularly interesting, because the function is not entirely trivial and the savings are quite sizable (¼KiB for a single variable that's written to exactly twice is impressive in my book - ymmv). I find it also helps me understand what's going on, since I don't need to understand the effects of any of the functions called to track the value of newStatus.

Very interesting article. I have one question, Marc, regarding the 4 tips for cashing class members. I agree with tips 3 and 4, i.e. grouping related members together. But I don't see why MRU members should go first, and LRU ones at the end. Essentially, it doesn't matter if your mostly used members are in one cache line or the next, right? I think that with tips 3 and 4, the result is the same, no matter of the ordering.

The idea is twofold, and doesn't have anything to do with separate cachelines. In fact, if your object spans two cachelines, you have more than one spot for the most-often (not most-recently) used member: every cacheline width a new one. Details:

You have a this pointer, and that points the the first-declared sub-object. If you need to offset into the object, you have another addition to perform.
Since a cacheline is wider than the memory word size, cachelines are filled in a particular order, usually starting with the word at the lowest address. That means that the first eight bytes of a cacheline (assuming a 64-bit memory bus) arrive in the cache first. If the data is located there, the CPU can continue work immediately. If the data the CPU waits for is in the second eight-byte-block, the latency is higher, in the third, even higher, etc.

Now, there's something called CWF (Critical Word First), which makes the memory-subsystem deliver the word the CPU is blocked on first, even though it's not the one at the lowest address. But AFAIK, CMIIW, that optimisation cannot arbitrarily reorder the words, so if the CPU first accesses data in the second word, then in the last word, it will profit from CWF for the first access but still take the latency hit on the second. Also, again, feel free to CMIIW, the hw prefetcher always loads cachelines in the standard order.

Return parameters by value whenever possible

There is an important exception to the rule - functions like getline which would need always to allocate new blocks of memory if they returned new strings, instead of reusing previously allocated memory of a ref-passed output string.

string getline()

void getline(string &)

For ref-passed, sometimes it will need reallocation, if the line we are currently reading is longer than the capacity of the string, but most of the time it will not be the case.

I disagree. If you need to pass in capacity, do so by moving:

string getline(string = string());

See also https://codereview.qt-project.org/157376 (last paragraph of commit message).

I agree with Ivan,

Moving is not always free. A class with an internal mutex is not cheaply movable. For big classes moving requires non-trivial work. There are non-movable classes.

An in-out parameter is certainly uglier but more readable when multiple in-out parameters are required.

Return by value is overhyped.

Very interesting article, however in tip 4, you haven't considered the memory alignment of the object which is usually depends on the order the member variables are declared.

What if the most used variables are not the multiple of 4-bytes, and hence compiler would add the padding bytes, and increase the size of the whole object.

This rule might be good for the special use case where the object with optimized memory layout is bigger than 64-byte and has the most accessed members in > 64-bytes.

Of course, there will always be pathological situations in which following any of these tips makes code slower. That doesn't void the tips. Use common sense to find the exceptions :)

That said, yes, the tip was supposed to be about the caching, but the section title suggests a wider scope. Will add.

Thanks!

("Return parameters by value whenever possible" title should be numbered 2.)

2: Search for "parts.nonimator" and notice your typo ;^>

3: the complex-multiply code is broken even for x *= y, since the method updates real before using it in the computation of imag, which should be using the prior value of real anyway. Correct code, even when a isn't this, requires cacheing of real.

Indeed, fixed! (finally :-)

Old article, but I just can't help myself -- your 1st formula for complex multiplication actually produces the WRONG answer. The calculation (2nd line of code) for (this->) imag refers to (this->) real WHICH has ALREADY been changed by the line of code above it!!. This problem goes away with your 2nd version (with the parameter-copying).

Of course, you're right - and this was also noted in the commend from Edward Welbourne back in 2017 which I didn't see until now.

Well, the way it's fixed now, the first version also doesn't have an aliasing problem anymore, rendering the prose around pointless. For a real-life example, see https://codereview.qt-project.org/c/qt/qtbase/+/191706