Work-Stealing Deque Part 2: Memory Ordering

Part 1 Part 2 Part 3 Part 4

In Part 1, we built a correct but slow deque using a standard std::mutex. The lock is a blunt instrument. It serializes access and prevents the CPU and compiler from playing tricks on us.

To remove the lock, we have to deal with those tricks directly. We need to understand Memory Ordering.

Open Table of contents

The Illusion of Sequential Execution
Level 1: Relaxed Ordering (memory_order_relaxed)
The Producer-Consumer Problem
Level 2: Acquire and Release (memory_order_acquire / memory_order_release)
Level 3: Sequentially Consistent (memory_order_seq_cst)
Summary

The Illusion of Sequential Execution

We tend to think of our code as a strict sequence of instructions: “Do A, then do B, then do C.”

But both the compiler and the CPU will aggressively reorder your instructions to squeeze out performance.

Compiler Reordering: The compiler might notice that “Task A” (loading a value) and “Task B” (calculating a sum) are independent. It might swap them to better utilize CPU registers or instruction pipelines.
CPU Reordering: Modern cores use Out-of-Order Execution. They have a pool of instructions. If the CPU sees a “read” that doesn’t depend on the previous “write,” it will just run the read first while the write is still waiting for a free slot. It doesn’t care about your source code order. It also employs other complex buffering and caching mechanisms that can delay when writes become visible to other threads.

To the CPU, the instruction stream isn’t a list. It is a dependency graph. If B doesn’t depend on A, the CPU will execute them in whatever order completes fastest.

For a single thread, this is invisible. The hardware guarantees as-if-serial semantics: the result is guaranteed to be the same as if it ran sequentially. But when you have multiple threads looking at the same memory, the illusion breaks. Thread B might see the result of the “sum” calculation before it sees the “load” that supposedly happened before it.

Level 1: Relaxed Ordering (`memory_order_relaxed`)

To control this chaos, C++ provides different levels of memory ordering. The most basic one is Relaxed Ordering.

This is the weakest guarantee, but it’s not “random.” It guarantees atomicity (no torn reads/writes) and modification order consistency.

Atomicity: You will never see half a write.
Modification Order: All threads will agree on the order of updates to this specific variable. If Thread A writes 5 and then 10 to x, Thread B will never see 10 and then 5.

However, it says nothing about the order of surrounding instructions or other variables. A relaxed store can be reordered freely with other relaxed stores to different addresses.

std::atomic<int> counter = 0;
counter.fetch_add(1, std::memory_order_relaxed);

Use this for simple counters or statistics where you don’t care if other threads see the updates in a specific order relative to other data.

The Producer-Consumer Problem

Now that we know about relaxed ordering, let’s look at a classic example where using it incorrectly breaks things. We have a “Producer” thread that prepares some data and sets a flag, and a “Consumer” thread that waits for the flag.

We’ll use std::atomic with memory_order_relaxed. This makes the code data-race free (it’s valid C++), but it is still incorrect because it doesn’t enforce order. This proves that “making it atomic” isn’t enough.

// INCORRECT CODE: This is broken!
std::string g_payload;
std::atomic<bool> g_ready = false;

// Producer Thread
void prepare_data() {
    g_payload = "Important Data"; // 1. Write payload
    g_ready.store(true, std::memory_order_relaxed); // 2. Set flag
}

// Consumer Thread
void process_data() {
    if (g_ready.load(std::memory_order_relaxed)) {
        // 3. Read payload
        std::cout << g_payload << std::endl; 
    }
}

The Bug: The compiler or CPU might decide to reorder steps 1 and 2 in the Producer. Why? Maybe the memory slot for g_ready is hot in the cache, while g_payload is cold. It’s faster to write the flag first.

If that happens, the Consumer sees g_ready is true, enters the if block, and reads g_payload before the data has been written. It prints garbage (or crashes).

To fix this, we need to enforce an order.

Level 2: Acquire and Release (`memory_order_acquire` / `memory_order_release`)

This is the workhorse of lock-free programming. It creates a synchronization point between two threads.

Release (Producer): “Make sure all my previous writes are visible before you make this write visible.”
Acquire (Consumer): “If I see the value from that Release, make sure I can also see all the writes that happened before it.”

Let’s fix our Producer-Consumer example:

// CORRECT
std::string g_payload;
std::atomic<bool> g_ready = false;

// Producer Thread
void prepare_data() {
    g_payload = "Important Data";
    // RELEASE: Prevents the write to g_payload from floating down past this point.
    g_ready.store(true, std::memory_order_release);
}

// Consumer Thread
void process_data() {
    // ACQUIRE: Prevents the read of g_payload from floating up past this point.
    if (g_ready.load(std::memory_order_acquire)) {
        std::cout << g_payload << std::endl;
    }
}

This establishes a “happens-before” relationship. We know for a fact that g_payload is written before g_ready is set, and g_ready is read before g_payload is accessed.

Level 3: Sequentially Consistent (`memory_order_seq_cst`)

This is the default for std::atomic if you don’t specify an order. It is the strongest and most expensive guarantee.

seq_cst enforces a single, global timeline that all threads agree on. With acquire/release, Thread A might see Event X before Event Y, while Thread B sees Y before X. With seq_cst, everyone sees X and Y in the exact same order.

You typically only need this when you have multiple variables that need to be synchronized across multiple threads in a complex way. For our deque, we will mostly use acquire/release, but we’ll need seq_cst for one critical race condition.

Summary

Relaxed: “Just do it atomically. I don’t care about order.”
Acquire/Release: “I’m passing data to you. Make sure you see the data if you see the flag.”
Seq Cst: “Everyone must agree on the global order of events.”

In Part 3, we’ll apply these tools to build the “Owner” side of our lock-free deque.

Work-Stealing Deque Part 2: Memory Ordering

Table of contents

The Illusion of Sequential Execution

Level 1: Relaxed Ordering (memory_order_relaxed)

The Producer-Consumer Problem

Level 2: Acquire and Release (memory_order_acquire / memory_order_release)

Level 3: Sequentially Consistent (memory_order_seq_cst)

Summary

Level 1: Relaxed Ordering (`memory_order_relaxed`)

Level 2: Acquire and Release (`memory_order_acquire` / `memory_order_release`)

Level 3: Sequentially Consistent (`memory_order_seq_cst`)