Skip to content
Go back

Deque Dojo Part 2: Taming the Out-of-Order Monster

Suggest an edit

Table of contents

Open Table of contents

Intro

A Note for Veterans: This post is a detailed, from-the-ground-up explanation of atomics and C++ memory ordering. If you are already comfortable with relaxed, acquire/release, and seq_cst semantics, you might want to skim this and jump straight to Part 3, where we will apply these concepts to build our lock-free deque. For everyone else, this is the essential foundation for everything that follows.

In Part 1, we built a correct but slow deque, guarded by a single, heavy lock. That lock was our “training wheels,” keeping us safe while we learned the basic movements. Now, it’s time to take them off.

Our goal is to replace the std::mutex with something far lighter: atomic operations. But this is not a simple swap. To do it safely, we must first understand the wild and chaotic world that the lock was protecting us from. We must meet the monster that lives at the heart of every modern CPU: the Out-of-Order Execution Monster.

The First Problem: The Race to the Whiteboard

Before we even get to the monster, let’s start with a simpler problem. Why is this code not thread-safe?

int g_counter = 0;

void increment() {
    g_counter++; // This is the danger zone
}

You might think g_counter++ is a single, indivisible operation. It is not. To the CPU, it’s a three-step dance:

  1. Read: Copy the current value of g_counter from memory into a CPU register.
  2. Modify: Add 1 to the value in the register.
  3. Write: Copy the new value from the register back to memory.

Now, imagine two threads, A and B, trying to increment the counter at the same time. The value starts at 0.

  1. Thread A reads 0 into its register.
  2. Thread B also reads 0 into its register. (Uh oh. A hasn’t written its result back yet.)
  3. Thread A adds 1 to its register (now 1) and writes 1 back to memory.
  4. Thread B adds 1 to its register (now 1) and writes 1 back to memory.

We performed two increments, but the final value is 1. We lost an update. This is a classic race condition.

The simple fix for this is to use std::atomic.

#include <atomic>
std::atomic<int> g_atomic_counter = 0;

void atomic_increment() {
    g_atomic_counter++; // Now thread-safe!
}

An std::atomic variable guarantees that its operations (like increment, load, store, exchange) are atomic, meaning they are indivisible. The read-modify-write dance happens as a single, unbreakable step from the perspective of all other threads. It’s like having a special whiteboard where only one person can write at a time, and their entire update happens instantly.

Problem solved, right? We can just replace all our shared variables with atomics and be done?

Unfortunately, no. Atomicity only solves the race condition on a single variable. It does nothing to stop our real nemesis: the Out-of-Order Monster.

The Real Monster: The Mailbox Out of Order

Modern compilers and CPUs are obsessed with speed. To keep things moving, they will freely reorder your instructions if they think it won’t change the result for a single thread.

Imagine a simple, everyday task: mailing a letter. There are two steps with a clear, logical dependency:

  1. Put the letter in the mailbox.
  2. Raise the red flag on the side of the mailbox to signal to the mail carrier that there’s something to pick up.

Now, imagine you have a hyper-optimizing robot assistant. You tell it the two steps. The robot analyzes the plan and thinks, “Walking to the mailbox is slow. I can be more efficient! I’ll raise the flag first while my human is still writing the letter. That will save a few seconds!”

The result is a disaster. The mail carrier drives by, sees the flag is up, opens the mailbox, finds it empty, and drives away. Your letter never gets sent.

This is exactly what can happen with your code. The robot assistant is your compiler and CPU. Consider this common pattern for signaling that some data is ready:

// Shared between threads
int g_data = 0;         // The letter
bool g_is_ready = false; // The mailbox flag

// Thread 1 (Producer)
void produce() {
    g_data = 123;       // Step 1: Put the letter in the box
    g_is_ready = true;  // Step 2: Raise the flag
}

// Thread 2 (Consumer - The Mail Carrier)
void consume() {
    if (g_is_ready) {   // Step 3: Check if the flag is up
        // Step 4: Read the letter
        // We expect g_data to be 123 here!
        int local_data = g_data;
    }
}

The compiler or CPU on Thread 1 is perfectly allowed to reorder the instructions in produce(). It might decide to set g_is_ready = true before it sets g_data = 123.

If this happens, Thread 2 can see g_is_ready as true, jump into the if block, and read g_data before it has been written. It will read 0 instead of 123. This is a catastrophic bug, and it’s caused by the Out-of-Order Monster.

To tame this monster, we need to give the CPU and compiler strict rules. We need to build “fences” that they are not allowed to cross. These fences are called memory ordering constraints.

The Three Levels of Discipline: Our Fences

When we use atomic variables, we can specify a memory order. This tells the system what kind of fence to build around the operation. There are three main levels of discipline, from weakest to strongest.

Level 1: Relaxed (std::memory_order_relaxed)

This is the weakest level. It’s like telling your robot assistant, “I don’t care what order you do things in.”

When to use it: Only for things where ordering doesn’t matter, like simple event counting where you only care about the final total. For example, counting the number of times a button is clicked across all threads.

std::atomic<int> g_click_count = 0;

void on_button_click() {
    // We just care that the count goes up, not WHEN it goes up
    // relative to anything else.
    g_click_count.fetch_add(1, std::memory_order_relaxed);
}

This is the cheapest memory order, but it’s also the most dangerous. If you’re not sure, don’t use it.

Level 2: Acquire and Release (The Handshake)

This is the most important and common memory order for coordinating work between threads. It’s a producer-consumer pattern.

Think of it as a set of magic instructions for our mailbox.

This creates a one-way fence. A release store on an atomic synchronizes with an acquire load on the same atomic only if the acquire reads the value written by that release (or by its release sequence).

Let’s fix our mailbox problem using this handshake:

// Shared between threads
int g_data = 0; // The letter (plain, non-atomic)
std::atomic<bool> g_is_ready{false}; // The mailbox flag (atomic)

// Thread 1 (Producer)
void produce() {
    g_data = 123; // Step 1: Put the letter in the box (plain store)

    // Step 2: Raise the flag with RELEASE semantics.
    // All prior writes (like g_data) happen-before a matching acquire that
    // observes this store.
    g_is_ready.store(true, std::memory_order_release);
}

// Thread 2 (Consumer - The Mail Carrier)
void consume() {
    // ACQUIRE ensures that if we observe 'true' written by the release above,
    // then subsequent reads see the producer's prior writes (including g_data).
    if (g_is_ready.load(std::memory_order_acquire)) {
        int local_data = g_data; // Guaranteed to see 123
    }
}

The release on g_is_ready ensures that the write to g_data happens before the flag is set. The acquire ensures that if we see the flag as true, we are also guaranteed to see the correct value of g_data. We have successfully tamed the monster.

This is the workhorse of lock-free programming and will be the key to our deque.

Level 3: Sequentially Consistent (std::memory_order_seq_cst)

This is the strongest, simplest, and most expensive level of discipline. It’s like having a town crier.

When the town crier makes an announcement, everyone in the town hears it, and they all agree on the order in which the announcements were made. There is no ambiguity.

This is the default memory order for all std::atomic operations if you don’t specify one. It’s the easiest to reason about because it prevents almost all surprising reorderings. However, it’s also the most expensive because it requires the CPU to coordinate heavily with all other cores to maintain this global timeline.

When to use it: When you have complex interactions between multiple threads that aren’t a simple producer-consumer pattern, or when you need to resolve a tricky race condition. As we’ll see in Part 3, we will need exactly one seq_cst operation to solve the “last-item race” in our deque, where an owner and a thief might try to grab the final task at the exact same time.

What’s Next: Applying Our New Knowledge

We’ve met the monster and learned the three magic words to tame it: relaxed, acquire/release, and seq_cst.

With this new knowledge, we are finally ready to take the training wheels off our deque. In Part 3, we will return to our locked code and, function by function, replace the std::mutex with carefully chosen atomic operations and memory orders. We will build a deque that is not only correct, but blazingly fast.


Suggest an edit
Share this post on:

Next Post
Deque Dojo Part 1: The Turnstile of Contention