A Race is a Race is a Race is UB An example of the difference between int, volatile int, and std::atomic

15.02.2017

Marc Mutz

No comments

In the last days, I was once again trying to convince fellow programmers that there’s no such thing as a “benign” data race. This is a recurring theme, in particular fueled by the docs of MSVC and Intel x86, which basically seem to say “you don’t need atomics here”.

I perused the excellent papers Benign data races: what could possibly go wrong? by Dmitry Vyukov and How to miscompile programs with “benign” data races by Hans Boehm. I discussed with colleagues. And I think a few more puzzle pieces regarding the relation between std::atomic, volatile and non-volatile, non-atomic access fell into place. Far be it from me to claim I understand relaxed atomics, let alone weak memory models, but I’d still like to share a little a-ha effect with you:

The memory model constrains the compiler, not the hardware.

Read on for details.

QAtomicInteger vs. MSVC

To keep this short, I will assume you are familiar with terms such as happens-before, synchronises-with, data race and that data races are undefined behaviour in C++. You can read up on this on cppreference.com or in the standard ([intro.races]).

The root of the discussion was the implementation of QAtomicInteger::load(). The documentation of the function claims it performs an “atomic[…] load[…] using relaxed memory ordering”. But the implementation just performs a load of a simple int.

In standard C++, this is clearly a data race. See your favourite post-C++11 C++ text book. There is consensus around that part. The controversial question is: is it on MSVC, too? Or is it well-defined?

I’ll spoil it for you by saying I don’t know, but I personally think that it mightn’t. Hurray for commitment, but indulge me as I try to explain…

MSVC extensions

At face value, the implementation does nothing wrong. The Windows API docs state that

Simple reads and writes to properly-aligned [word-sized] variables are atomic operations. In other words, you will not end up with only one portion of the variable updated; all bits are updated in an atomic fashion. However, access is not guaranteed to be synchronized. If two threads are reading and writing from the same variable, you cannot determine if one thread will perform its read operation before the other performs its write operation.

and the C++ standard says that concurrent atomic access does not create data races:

[intro.races]/19: The execution of a program contains a data race if it contains two potentially concurrent conflicting actions, at least one of which is not atomic, and neither happens before the other.

It makes sense, too: Modern coherent caches that work with the MESI model make all accesses that do not cross cache-line boundaries atomic. I’m simplifying. You get the idea. And everyone knows that x86 has a sequentially-consistent memory model…

There, everything is peachy.

Or is it? One thing should take you aback: If every access is atomic, why do I need std::atomic in the first place?

As so often is the case when undefined behaviour is involved, the answer is:

It’s the compiler, stupid!

It is utterly irrelevant what the hardware provides. That matters when you program in assembler. When you program in C/C++, you’re programming the C/C++ abstract machine, not any particular hardware. That abstract machine is defined by the standard. The standard makes certain operations undefined. And here comes the death blow: A compiler is allowed to assume that undefined behaviour does not happen.

Over time, compilers have gotten better and better at exploiting that part of the contract and programmers have failed to become better at fulfilling their end of the bargain. Tools like the sanitizers, available on Clang and GCC, empower programmers, but programmers need to buy into the contract, too.

Back to the main topic: Are the MSVC and Windows docs wrong?

Atomic vs. volatile vs. simple int

I believe they are. At least misleading. To understand why, we need to understand the difference between atomics and simple ints.

Let’s take a loop that counts something:

static CounterType counter; // global, zero-initialised

// in some function:
for (...)
   ++counter; // or counter.fetch_add(1, std::memory_order_relaxed) for relaxed atomics

Depending on CounterType, the compiler is allowed to make certain assumptions and is thus allowed to perform certain optimisations (or not).

std::atomic<int>

If CounterType is std::atomic<int>, prefix increment is sequentially consistent. AFAICT, this means that the compiler needs to (make the hardware) write each value into the global variable, one by one: the counter could act as a kind of spinning counted-semaphore, i.e. as a synchronisation mechanism between threads, so there’s nothing the compiler can optimize without possibly breaking the program.

That is a bit as if CounterType was volatile int, except that volatile accesses need not be atomic, so the code would invoke undefined behaviour in this case.

Relaxed memory order

If we use std::atomic<int> with relaxed memory order (counter.fetch_add(1, std::memory_order_relaxed), we give the compiler the freedom to combine writes:

register r = 0;
for (...)
    ++r;
counter.fetch_add(r, std::memory_order_relaxed); // atomic operation

This is allowed, since counter‘s modification order is not synchronised with any other object’s. But the compiler must still guarantee that the only values another thread can observe are those which some thread explicitly wrote (in the C++ source code).

Which brings us to the least-relaxed case:

Simple int

If CounterType is a mere int, the compiler can use an even better optimisation:

register r = counter;
for (...)
    ++r;
counter = r;

Here, we don’t need to atomically add to counter, so this is, indeed, and optimisation on top of the relaxed atomic case.

It is also clear that this optimisation will produce the wrong answer when two threads execute concurrently:

// counter initially zero
// T1                T2

r1 = counter;        r2 = counter;    // r1 == 0, r2 == 0
for (c1 times)       for (c2 times)
    ++r1;                ++r2;
                                      // r1 == c1, r2 == c2
counter = r;                          // counter == c1
                     counter = r;     // counter == c2 !! correct result would be: c1 + c2

But why is it allowed, exactly? Because the access is through a non-volatile, non-atomic object. Therefore the compiler knows that counter cannot be accessed concurrently. Remember that that would be a data race, thus UB, and the compiler may assume that UB cannot happen.

From Hans Boehm’s paper, I expect that almost all compilers routinely performed this more or less unconditionally before C++11/C11.

And it’s a terribly important optimisation. My guess is that most of the wording of the C++ memory model lies in the desire to keep these kinds of optimisations allowed on most objects. And the beauty of SC-DRF is that it succeeds here.

This example made me finally understand the difference between simple loads, volatile loads, and atomic (relaxed) loads. It’s not about what the hardware can or cannot do. It’s mostly about what the compiler can or cannot do.

MSVC again

Now, finally, we can answer the question whether a simple load on MSVC/Windows/x86 is the same as a relaxed atomic load.

For the answer to be “yes, they are the same”, a necessary, though not sufficient, condition is that the MSVC compiler does not perform the last optimisation. Ever.

OTOH, performing the optimisation is sufficient for the answer to be “no, they are not the same”.

And this is all I know.

In particular, I don’t know whether MSVC performs that optimisation or not. But I’d be very surprised if it didn’t. And so I believe that the MSVC/Windows docs are misleading, if not downright wrong.

Conclusion

If in doubt, don’t assume, stick to what the standard says: that potentially-concurrent non-atomic (as in std::atomic) conflicting actions are data races.

I hope I convinced you that even on MSVC, this is undefined behaviour. This is not a “benign” data race.

Defenders of the myth of “benign” data races will be quick to point out that, yes, while this particular case is not a “benign” data race, XXXX is. This is not an argument you can win. It’s like being sued by Google: “You infringe these 10 patents of ours”. “No, here’s proof”. “Ok, you’re right, then. But you infringe on these 10 patents”.

The only thing that you can do is to accept that a bunch of very smart people have put a lot of thought into the memory model. And that if they got it wrong, they will try again, until they get it right. Don’t buy into the myth of “benign” data races. Stick to the letter of the standard, because that’s what the compiler does, too.

Categories: C++