Advanced Thread Safety in C++

Paul J. Lucas - Jul 13 '23 - - Dev Community

Introduction

C++ supports writing programs where parts of the code run concurrently using threads. When writing such programs, you have to take additional steps to ensure that data shared among threads can not cause race conditions. Typically, race conditions are avoided by proper use of mutexes and locks. However, in high-performance code, mutexes can sometimes be too costly, especially with high-contention data. C++ supports alternatives.

Note: what follows will not describe how to create or join threads since those things are largely straightforward. Instead, what follows will describe how to write programs safely in a multithreaded program.

A Brief History

In the late 1960s, early 1970s, programs (running on computers such as the PDP-7 and later the PDP-11) were:

  • Single-threaded. (While nascent “threads” appeared in 1966, POSIX threads didn’t arrive until 1995.)
  • Executed sequentially. (That is, machine code instructions generated by compilers were executed one at a time and in the same order as the original statements in your program were written.)
  • Running in flat (non-hierarchical) memory. (CPU caches didn’t exist until the late 1970s and didn’t go mainstream until 1993 with the Intel Pentium.)

While CPU performance was increasing rapidly, memory performance lagged:

CPU-Memory Performance Gap

Performance Mitigation Tactics

In order not to have overall performance constrained by memory, CPU designers had to employ mitigation tactics of:

All of these tactics are done entirely by the hardware. You as the programmer have no way either to know or really influence what’s going on under the hood.

Additionally, compiler implementors also employ their own mitigation tactics of:

The compiler is free to do any optimization so long as it makes no change to the observable behavior of your program (the “as-if rule”).

As long as your program has only a single thread, you can remain blissfully unaware of either the hardware or compiler tactics being employed. However, once you have multiple threads, you must become acutely aware of what’s going on under the hood.

Atomic

When doing multithreaded programming, you’ll hear the term “atomic” used. What, exactly, is meant by “atomic” anyway? There are three related, but distinct, meanings:

  1. A value operation (read or write) completes with no possible intervening operation by another thread, e.g., writing single-byte values on any CPU, or 16- or 32-bit values on a 32-bit CPU, or 64-bit values on a 64-bit CPU.

  2. An updated value is visible to other CPUs.

  3. Multiple value operations complete with no possible intervening operation by another thread, e.g., updating A and B together (“transactional”).

“Atomic” always means #1; it usually also means #2; or all three.

A (Bad) Example

Consider the following code where the main thread spawns another. Both threads run so long as shutdown remains false. To shut down, the main thread sets shutdown to true. In response, both while loops will exit and the main thread will join the other thread. Unfortunately, the code is wrong. (More unfortunately, I’ve seen such code in production.)



bool shutdown;

void thread_1() {
    while ( !shutdown ) {
        // ...
    }
}

int main() {
    std::thread t1{ thread_1 };
    while ( !shutdown ) {
        // ...
        if ( some_condition )
            shutdown = true;
    }
    t1.join();
}


Enter fullscreen mode Exit fullscreen mode

It’s wrong because shutdown is shared by multiple threads improperly. Some programmers mistakenly think along the lines of:

“It’s just a bool, a single byte, so of course it’s thread-safe because you can’t read or write half a byte, so it must be atomic.”

The problem is that while that’s actually true, it’s insufficient. A bool is atomic only by meaning #1; it’s not atomic by meaning #2. That is, just because a bool is updated by one thread does not mean that updated value is visible to other threads (running on other CPUs).

A (Bad) Fix: volatile

I’ve also seen code where programmers knew they had to do something more to make such code thread-safe. Unfortunately, that something turned out to be:



volatile bool shutdown;  // Thread-safe now?  No!


Enter fullscreen mode Exit fullscreen mode

That is, they inserted volatile because they kind-of understand what it does, but not what it’s for. The attempted use of volatile to make something thread-safe is always wrong in C++.

To understand what volatile does and what it’s for, see here.

A Better Fix: Mutexes & Locks

The classic way to fix this code (correctly) would be to use a mutex:



bool shutdown;
std::mutex shutdown_mux;

void thread_1() {
    while ( true ) {
        // ...
        std::lock_guard<std::mutex> lock{ shutdown_mux };
        if ( shutdown )
            break;
    }
}

int main() {
    std::thread t1{ thread_1 };
    while ( true ) {
        // ...
        if ( some_condition ) {
            std::lock_guard<std::mutex> lock{ shutdown_mux };
            shutdown = true;
            break;
        }
    }
    t1.join();
}


Enter fullscreen mode Exit fullscreen mode

The mutexes and locks make shutdown be atomic by meaning #1 and meaning #2. While not necessary for this program, mutexes and locks also make things atomic by meaning #3 as well, so they’re the most general, one-size-fits-all tool for making programs thread-safe. However, C++ offers alternative tools that provide a more custom fit especially useful in high-performance code.

A Word of Caution

Writing safe multithreaded code even with ordinary mutexes and locks is hard to get right, even for experts. Writing safe multithreaded code using advanced thread-safety techniques is even harder! Before considering them, you should:

  • Know that using efficient algorithms matters far more than locking technique. For example, an N⋅log(N) algorithm with slower mutexes will still very likely be more performant than an N2 algorithm with faster locking (or lock-free) techniques.

  • Measure to see if your code is spending too much time locking. (Remember: results are CPU-dependent!)

  • Only if locking takes a significant percentage of time, then consider the following techniques.

An Even Better Fix: std::atomic

While use of mutexes and locks is correct, they’re fairly heavy-weight. In the case of using a bool, a much lighter-weight fix is:



std::atomic<bool> shutdown;  // Thread-safe now?  Yes!


Enter fullscreen mode Exit fullscreen mode

Use of std::atomic<T> does not make shutdown atomic by meaning #1 (it already is); it makes it atomic by meaning #2.

The std::atomic<T> class is specialized for all built-in types (bool, char, int, etc.) and all pointer types (T* for any T). It can even be used for your own type T, but only when:

  1. T is trivially copyable (that is, if memcpy() would correctly copy T).
  2. sizeof(T) ≤ 16 (typically).

When T is an integral or pointer type, std::atomic<T> has conventional operators overloaded for it:



std::atomic<int> x{ 0 };

x = 1;      // x.store(1)
int i = x;  // x.load()
++x;        // x.fetch_add(1)
x++;        // x.fetch_add(1)
--x;        // x.fetch_sub(1)
x--;        // x.fetch_sub(1)
x += 1;     // x.fetch_add(1)
x -= 1;     // x.fetch_sub(1)
x &= 1;     // x.fetch_and(1)
x |= 1;     // x.fetch_or(1)
x ^= 1;     // x.fetch_xor(1)


Enter fullscreen mode Exit fullscreen mode

The operators are shorthands for the member functions shown in the comments. While the operators are convenient, I always use the member functions to make it obvious in code that the variable being manipulated is a std::atomic.

Incidentally, this:



x = x + 1;  // not the same as either ++x or x += 1


Enter fullscreen mode Exit fullscreen mode

is not thread-safe because the read of x and the write to x are not atomic by meaning #3.

Lock-Free Atomics

Even though std::atomic<T> offers better performance over mutexes and locks, it’s not guaranteed to. For a type T on a specific CPU under specific conditions, is_lock_free() and is_always_lock_free() check. Those conditions include:

  • Accessing unaligned values, e.g., reading an int at a memory address that’s not evenly divisible by sizeof(int) (assuming the CPU can even do it at all without generating a bus error).

  • Runtime CPU dispatching, e.g., a particular program during initialization may detect whether the CPU supports the cmpxchg16b instruction and, if not, may be forced to use locks.

The only type that’s guaranteed to be always lock-free on all CPUs under all conditions is std::atomic_flag that’s a special-case of std::atomic<bool>. (An example using std::atomic_flag is shown later.)

Memory Barriers

Atomic (by any meaning) is not enough to ensure thread-safety because the order of memory operations does not necessarily match the order of statements in a program. Why not? Because the hardware, the compiler, or both, may reorder things to improve performance. The “as-if” rule holds only from the perspective of a single thread.

Memory barriers (aka, “fences”) help ensure thread-safety by selectively prohibiting reordering of memory operations across the barrier. They also provide some synchronization among threads. C++ provides std::memory_order.

The sometimes confusing thing about memory barriers is that they’re about controlling the order of operations, not the operations themselves.

memory_order_seq_cst: Sequential Consistency

This is the safest memory order which is why it’s the default. For example, the signature of std::atomic<T>::fetch_add() is:



T fetch_add( T arg, std::memory_order order
                    = std::memory_order_seq_cst ) noexcept;


Enter fullscreen mode Exit fullscreen mode

It’s also the least efficient because it establishes a global memory ordering (bottleneck) across all threads. In specific cases, more efficient memory orders can be used.

memory_order_relaxed: No Ordering

This is the least safe memory order in that it does not guarantee operation order or synchronization, but still guarantees modification order. What good is that? A typical use-case is incrementing reference counts:



template<typename T>
class shared_ptr {
public:
    // ...
private:
    struct counted_obj {
        // ...
        std::atomic<size_t> _count;
        T _obj;
    };

    counted_obj *const _co;

    void _inc_ref() noexcept {
        _co->_count.fetch_add( 1, std::memory_order_relaxed );
    }

    // ...
};


Enter fullscreen mode Exit fullscreen mode

At this point, you might ask something along the lines of:

“If there’s no synchronization, how can this possibly work? Couldn’t thread 1 increment the value from N to N+1 and thread 2 also increment the old value from N to N+1?”

The answer is no for two reasons:

  1. Each increment is still atomic (by meanings #1 and #2) so each thread always sees the latest value. What memory_order_relaxed does is allow the hardware or the compiler to reorder other operations having nothing to do with _count either before or after its increment to improve overall performance, not the performance of _count specifically.

  2. Because _count is part of a private data structure, it’s guaranteed that no actions are ever conditionally taken by another thread based on its current value. Said another way, when only performing an increment and not altering the control flow in any thread, then use of memory_order_relaxed is safe.

An example of when memory_order_relaxed is not safe is when decrementing reference counts. (More later.)

Now that you hopefully understand memory_order_relaxed, you should never use it unless you can prove your use of it is correct and it actually significantly improves performance. Correct use of memory_order_relaxed is very hard to do.

memory_order_acquire & memory_order_release

These memory orders are safer and always used in pairs:

  • memory_order_release (used with store()) is used to “publish” information: no accesses can be reordered after.

  • memory_order_acquire (used with load()) is used to “subscribe” to information: no accesses can be reordered before.

In this example, shared_data is shared among two threads and data_ready that acts as a semaphore:



int shared_data;
std::atomic<int> data_ready{ 0 };

void producer() {
    shared_data = 42;
    data_ready.store( 1, std::memory_order_release );
}

void consumer() {
    while ( data_ready.load( std::memory_order_acquire ) == 0 )
        ;
    assert( shared_data == 42 );  // always true
}

int main() {
    std::thread t1{ producer };
    std::thread t2{ consumer };
    t1.join();
    t2.join();
}


Enter fullscreen mode Exit fullscreen mode
  • The producer sets shared_data then signals that the data is ready (or “publishes” it). The write to shared_data “happens before” the write to data_ready because the use of memory_order_release guarantees that the write to shared_data can not be reordered after data_ready.

  • Meanwhile, the consumer busy waits for the signal on data_ready, then safely reads shared_data. The read from data_ready “happens before” the read from shared_data because the use of memory_order_acquire guarantees that the read from shared_data can not be reordered before data_ready.

Things to note:

  • shared_data itself need not be atomic. (This is particularly useful for data that can not be made atomic either because it’s too big or is not trivially copyable.)
  • You can set any amount of data, then “publish” all of it simultaneously.

At this point, you might ask whether busy waiting is efficient. In general, it’s efficient when the waits are short — shorter than the time it would take to lock and unlock a mutex.

A general spinlock (a kind of busy waiting) can be implemented using std::atomic_flag:



std::atomic_flag mutex = ATOMIC_FLAG_INIT;

class spin_lock {
public:
    explicit spin_lock( std::atomic_flag *mutex ) noexcept :
        _mutex{ mutex }
    {
        while ( !_mutex->test_and_set( std::memory_order_acquire ) )
           ;
    }
    ~spin_lock() noexcept {
        _mutex->clear( std::memory_order_release );
    }
private:
    std::atomic_flag *const _mutex;
};


Enter fullscreen mode Exit fullscreen mode

memory_order_consume

This memory order is a special case of memory_order_acquire that allows operations to be dependency-ordered that can be more efficient on weakly ordered CPUs (ARM, PowerPC; but not x86). From the earlier example:



int shared_data;
std::atomic<int> data_ready{ 0 };


Enter fullscreen mode Exit fullscreen mode

There’s no actual dependency between shared_data and data_ready other than what’s in our minds — which the compiler has no way to know. To create an actual dependency that the compiler can use to keep things in dependency order, we can instead do:



std::atomic<int*> data_ready{ nullptr };
 
void producer() {
    static int shared_data;
    shared_data = 42;
    data_ready.store( &shared_data, std::memory_order_release );
}

void consumer() {
    while ( (data_ready.load( std::memory_order_consume )) == nullptr )
        ;
    assert( *data_ready == 42 );  // always true
}


Enter fullscreen mode Exit fullscreen mode

Now, we’ve created an actual dependency using a pointer and a pointed-to value in that the value of the pointer has to be loaded before the pointer is dereferenced. The compiler can preserve this order without using an additional and more expensive memory barrier.

memory_order_acq_rel

This memory order is memory_order_acquire and memory_order_release combined into one used for read-modify-write operations: no accesses can be reordered before or after. A typical use-case is decrementing reference counts:



template<typename T>
class shared_ptr {
public:
    // ...

    ~shared_ptr() noexcept {
        if ( _co->_count.fetch_sub( 1, std::memory_order_acq_rel ) == 1 )
            delete _co;
    }

    // ...
};


Enter fullscreen mode Exit fullscreen mode

The reason memory_order_relaxed can’t be used here is because a use of the reference counted object in another thread could be reordered to be after the delete happens resulting in undefined behavior.

std::atomic_thread_fence()

Using std::atomic with memory_order_relaxed is atomic without a barrier; using std::atomic_thread_fence() is a barrier without a specific atomic. It’s useful if you want to update several atomic values together. For example, instead of doing:



atomic<int> x, y;
x.store( 1, std::memory_order_release );
y.store( 2, std::memory_order_release );


Enter fullscreen mode Exit fullscreen mode

you can do:



x.store( 1, std::memory_order_relaxed );
y.store( 2, std::memory_order_relaxed );
std::atomic_thread_fence( std::memory_order_release );


Enter fullscreen mode Exit fullscreen mode

But there is another difference. Given:



std::atomic<int> sync{ 0 };
// (1) load & store operations
sync.store( 1, std::memory_order_release );
// (2) store operations


Enter fullscreen mode Exit fullscreen mode

No operations in (1) will be reordered after the store-release of sync into (2); however, store operations in (2) can be reordered before the release into (1).

std::atomic_thread_fence() imposes stronger ordering guarantees than an operation on an atomic with the same memory order. Given:



// (1) load & store operations
std::atomic_thread_fence( std::memory_order_release );
// (2) load operations
sync.store( 1, std::memory_order_relaxed );
// (3) store operations


Enter fullscreen mode Exit fullscreen mode

No operations in (1) will be reordered after all subsequent store operations into (3) and store operations in (3) can not be reordered before the fence into (1). However, since memory_order_release controls store operations only, load operations in (2) can be reordered before the release into (1).

The rules for memory_order_acquire are similar except it controls load operations only and enforces ordering in the opposite direction. The differences are subtle. In general, an acquire or release associated with a particular atomic is preferred.

Compare-and-Swap (CAS)

Compare-and-swap (CAS), as its name suggests, is both compare and swap operations done together atomically (by all atomic meanings). Conceptually, it’s implemented as:



template<typename T>
bool atomic<T>::compare_exchange( T &expected, T desired ) {
    if ( this->_value == expected ) {
        this->_value = desired;
        return true;
    }
    expected = this->_value;
    return false;
}


Enter fullscreen mode Exit fullscreen mode

except done atomically. The idea is that you check the value of an atomic variable and:

  • If it’s the value you expect (or hope for), then (and only then) set it to a new desired value; or:
  • If it’s not the value you expect (or hope for), it means some other thread changed the value, so do nothing. You are free to reattempt setting the value.

There are actually two flavors: compare_exchange_weak() and compare_exchange_strong(). (More on the difference later.)

I don’t know why the C++ committee named these functions with “compare exchange” instead of the conventional “compare and swap.” I’ll continue to use the “CAS” abbreviation regardless.

For example, we can reimplement the previous spin_lock class using CAS:



std::atomic<bool> mutex;

class spin_lock {
public:
    explicit spin_lock( std::atomic<bool> *mutex ) noexcept :
        _mutex{ mutex }
    {
        bool flag = false;
        while ( !_mutex->compare_exchange_weak( flag, true ) )
            flag = false;  // must reset
    }

    ~spin_lock() noexcept {
        _mutex->store( false );
    }

private:
    std::atomic<bool> *const _mutex;
};


Enter fullscreen mode Exit fullscreen mode

Using CAS, we check the value of _mutex:

  • If it’s false, it means it’s currently unlocked, so set it to true to indicate it’s now locked and return true.

  • If it’s true, it means it’s currently locked (by another thread), so do not set the value and return false.

    Note that in this case, the first parameter of flag is an in/out parameter and it’s set to the atomic’s current value (here, true) even though we don’t care. This forces us to reset flag to false prior to another attempt.

Lock-Free Operations

One of the primary use-cases for CAS is that it allows you to implement lock-free operations on data structures. For example, part of a lock-free stack implementation might be:



template<typename T>
class stack {
public:
    struct node {
       node *_next;
       T _data;
       explicit node( T &&data ) : _data{ std::move( data ) } { }
    };

    void push( T &&data ) {
        node<T> *new_node = new node<T>{ std::move( data ) };
        new_node->_next = _head.load( std::memory_order_relaxed );
 
        while ( !_head.compare_exchange_weak(
            new_node->_next, new_node,
            std::memory_order_release,
            std::memory_order_relaxed ) );
    }

private:
    std::atomic<node<T>*> _head;
};


Enter fullscreen mode Exit fullscreen mode

After a new node is created, we try to update _head:

  • If it’s still equal to new_node->_next (the original value of _head), update _head to point to new_node using the memory order specified by the third argument.

  • If it’s not equal, it means another thread snuck in and updated _head to point to different new node, so do nothing and reattempt. (Note that new_node->_next has been updated to be the new _head pointing to the different node using the memory order specified by the fourth argument.)

compare_exchange_weak() vs compare_exchange_strong()

The existence of compare_exchange_weak() implies there’s a compare_exchange_strong() — and there is. The difference between them is:

  • compare_exchange_strong() fails (returns false) only if the current value is not the expected value.

  • compare_exchange_weak() may also fail if there’s a “spurious failure.”

A “spurious failure” is quirk on weakly ordered CPUs (e.g., ARM and PowerPC; but not x86) where the operation fails for reasons other than the value not being the expected value.

So if compare_exchange_strong() never fails spuriously, why does compare_exchange_weak() exist? Before answering that question, let’s look at a conceptual implementation of them both:



enum class cas_result {
    EQUAL,
    NOT_EQUAL,
    SPURIOUS_FAILURE
};

template<typename T>
cas_result cas_weak_impl( atomic<T> *p, T &expected, T desired );

template<typename T> 
bool atomic<T>::compare_exchange_weak( T &expected, T desired ) {
    return cas_weak_impl( this, expected, desired ) == cas_result::EQUAL;
}

template<typename T>
bool atomic<T>::compare_exchange_strong( T &expected, T desired ) {
    cas_result cr;
    do {
        cr = cas_weak_impl( this, expected, desired );
    } while ( cr == cas_result::SPURIOUS_FAILURE );
    return cr == cas_result::EQUAL;
}


Enter fullscreen mode Exit fullscreen mode

Assume there’s only cas_weak_impl() that implements a weak version of CAS:

  • compare_exchange_weak() is a simple wrapper around cas_weak_impl().

  • compare_exchange_strong(), however, wraps it with a loop that effectively filters out spurious failures. The important thing to remember here is that there’s a loop.

Given that, the benefits of compare_exchange_weak() are:

  1. Spurious failures tend not to happen all that often.

  2. When your code is using a loop anyway, the weak version will yield better performance on weakly ordered CPUs (ARM, PowerPC; but not x86 — but it’s no worse).

  3. Detects the ABA Problem (on weak CPUs) — more later.

So then why does compare_exchange_strong() exist?

  • If you have a loop only to filter out spurious failures, don’t: use compare_exchange_strong().

  • But if you have a loop anyway, use compare_exchange_weak().

  • However, if handling a spurious failure is expensive (for example, if you have to discard and reconstruct a new object), use compare_exchange_strong().

  • But compare_exchange_strong() doesn’t detect the ABA Problem (more later).

Given all that, you might wonder when would you ever not have a loop for compare_exchange_strong()? One example is implementing try_lock():



bool my_try_lock( std::atomic<bool> *mutex ) noexcept {
    bool flag = false;
    return mutex->compare_exchange_strong( flag, true );
}


Enter fullscreen mode Exit fullscreen mode

The “ABA Problem”

The ABA Problem can be illustrated as follows. Suppose thread 1 performs the following steps:

  1. Read a memory location (value is “A”).
  2. Do some work.
  3. Read the same memory location again (value is still “A”).
  4. Compare the original and recent values (they’re equal).
  5. Conclusion: nothing has changed.

The problem is suppose thread 2 performs the following steps while thread 1 is doing its step 2:

  1. Write “B” to the same memory location.
  2. Do some work.
  3. Write “A” to the same memory location.

By the time thread 1 does its step 3, it reads “A” and believes nothing has changed — even though it has. You might now ask:

“If the value of “A” is the same, why does it matter?”

The answer is: sometimes it doesn’t — but sometimes it does.

Consider the stack<T>::push() code from earlier (repeated here for convenience):



void push( T &&data ) {
    node<T> *new_node = new node<T>{ std::move( data ) };
    new_node->_next = _head.load( std::memory_order_relaxed );
 
    while ( !_head.compare_exchange_weak(
                new_node->_next, new_node,
                std::memory_order_release,
                std::memory_order_relaxed ) );
}


Enter fullscreen mode Exit fullscreen mode

and the following illustration:

ABA push

State (1) shows the initial conditions where A and B are nodes on the stack, _head (H) points to A, and new_node (N) has its _next also point to A. The dotted box contains _head and new_node->_next that are being compared. State (3) shows the desired final state where _head points to new_node and new_node->_next points to A.

But what if another thread sneaks in before the while loop is entered, pushes a new node X shown by state (2), then immediately pops it? Both _head and new_node->_next still point to A, so _head will be set to new_node — which is correct. In this case, the ABA Problem isn’t actually a problem.

But now consider what the code for stack<T>::pop() might be:



    T pop() {
        node<T> *first = _head.load( std::memory_order_relaxed );

        while ( first != nullptr &&
                !_head.compare_exchange_weak(
                    first, first->_next,
                    std::memory_order_release,
                    std::memory_order_relaxed ) );

        if ( first == nullptr )
            throw empty_stack_exception{};
        T ret_val{ std::move( first->_data ) };
        delete first;
        return ret_val;
    }


Enter fullscreen mode Exit fullscreen mode

and the following illustration:

ABA pop

State (1) shows the initial conditions where A, B, and C are nodes on the stack and _head (H) and first (F) both point to A. The dotted box contains _head and first that are being compared. State (5) shows the desired final state where _head points to B.

But what if another thread sneaks in before the while loop is entered, pops A and B shown by state (2), then pushes A shown by state (3)? Both _head and first still point to A, so _head will be set to first->_next — which is wrong! Why? Originally, first pointed to A whose _next pointed to B, hence first->_next (the desired argument in the compare) is B. But B was deleted in (2) so we’ll end up in state (4) with _head being a dangling pointer to B that was deleted. In this case, the ABA Problem is actually a problem! (It wasn’t a problem for push() because the desired value of new_node could never become stale.)

Even worse, there’s no easy fix for this. In this case, the problem is that part of the desired value expression can change. While first stays the same, what first->_next points to can change. Detecting ABA Problems is hard, even for experts.

How can this be fixed?

  1. Give up and just use a mutex and locks.
  2. Implement versioned pointers.
  3. Implement hazard pointers.

Versioned Pointers

A versioned pointer is an ordinary pointer plus an additional “version number” such that every time the value of the pointer changes, the version number is incremented. Conceptually, something like:



template<typename T>
class vers_ptr {
public:
    vers_ptr() noexcept : _ptr{ nullptr }, _vers{ 0 } { }
    vers_ptr( T *ptr ) noexcept : _ptr{ ptr }, _vers{ 1 } { }

    vers_ptr& operator=( T *ptr ) noexcept {
        set( ptr );
        return *this;
    }

    // ...

private:
    T *_ptr;
    uintptr_t _vers;  // must be sizeof(T*) for cmpxchg16b

    void set( T *ptr ) noexcept {
        _ptr = ptr;
        ++_vers;
    }
};

static_assert( sizeof( vers_ptr<void> ) == 16 );


Enter fullscreen mode Exit fullscreen mode

Then use vers_ptr<node> instead of node<T*>:



    struct node {
        vers_ptr<node> _next;
        // ...
    };

    T pop() {
        vers_ptr<node> first = _head.load( std::memory_order_relaxed );

        while ( first != nullptr &&
                !_head.compare_exchange_weak(
                    first, first->_next,
                    std::memory_order_release,
                    std::memory_order_relaxed ) ) ;
        // ...
    }

    // ...

    std::atomic<vers_ptr<node>> _head;


Enter fullscreen mode Exit fullscreen mode

This would work because, even though the _ptr part of first would still point to A, its _vers part would be different, so _head and first would not compare equal, so _head would not be set to B, the stale value of first->_next.

The caveats of vers_ptr, however, are:

  • This will work only if the CPU supports double-pointer-wide (16 bytes on a 64-bit CPU) CAS (which, for example, x86_64 does via the cmpxchg16b instruction).

  • As mentioned, a specialization of std::atomic<T> requires T to be trivially copyable — which is why operator=(vers_ptr<T> const&) can’t be implemented.

Cache Lines

For many CPUs, the L1 cache is “chunked” into cache lines typically ranging from 16-64K in size each. To read a given memory location from main memory into the cache, the location and the surrounding chunk-sized bytes are all read in together. Similarly, to write a given memory location from the cache into main memory, the entire cache line is written. For code that exhibits locality of reference, the chunking yields a performance gain. However, in some cases, it can yield a performance loss.

Consider a lock-free queue:



template<typename T>
class queue {
    // ...

    std::atomic<node*> _head;
    std::atomic<node*> _tail;
};


Enter fullscreen mode Exit fullscreen mode

Assume that your code has one thread pushing items onto the tail of the queue (repeatedly updating _tail) and a second thread popping items from the head of the queue (repeatedly updating _head).

In the code as given, _head and _tail will very likely reside on the same cache line. This means updating one will invalidate the entire cache line adding otherwise unnecessary contention for the other.

In a case such as this, you want _head and _tail to reside on different cache lines so updating one doesn’t affect the other. You can actually achieve that by using alignas and hardware_destructive_interference_size:



template<typename T>
class queue {
    // ...
    alignas(std::hardware_destructive_interference_size) std::atomic<node*> _head;
    alignas(std::hardware_destructive_interference_size) std::atomic<node*> _tail;
};


Enter fullscreen mode Exit fullscreen mode

This will waste a little bit of memory, but ensure that _head and _tail reside on different cache lines.

Conclusion

To summarize:

  • Your algorithm matters far more than locking technique.

  • Thread-safety in general and using atomics and barriers directly specifically is very hard to get right, even for experts.

  • Before using advanced thread-safety techniques, measure to see if your code is spending too much time locking. (Remember: results are CPU-dependent!)

  • Only if locking takes a significant percentage of time, then consider atomics and barriers.

  • Atomics and barriers are two sides of the same coin.

  • Compare-and-swap (CAS) is a fundamental building-block.

  • Be aware of the ABA Problem.

  • Be aware of cache lines.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .