Deque Dojo Part 2: Taming the Out-of-Order Monster

Part 1 Part 2 Part 3 Part 4

In Part 1, we built a correct but slow deque, guarded by a single, heavy lock. That lock was our shield. It protected us from a wild world of concurrency hazards. To remove that lock, we must first understand the monster it was caging: the Out of Order Execution Monster.

Open Table of contents

The Real Monster: A Tale of Two Kitchens

The Real Monster: A Tale of Two Kitchens

Modern compilers and CPUs are obsessed with speed. To achieve it, they will freely reorder your instructions if they believe the final result will be the same for a single thread. To understand why this is both a brilliant optimization and a terrible danger, let’s visit two kitchens.

Scenario 1: The Single Threaded Food Truck

Chef Solo runs a food truck by himself. He is the chef, the cashier, and the waiter. He gets an order for a toasted sandwich and a soda.

Toast the bread. (Takes 15 seconds).
Pour the soda. (Takes 5 seconds).
Assemble the sandwich and bag the order.

Chef Solo’s brain, a natural optimizer, realizes something. While the bread is in the toaster for 15 seconds, he can pour the 5 second soda. He reorders his tasks:

Put bread in toaster.
Pour the soda.
Get toast, assemble sandwich, and bag the order.

The result is the same: one customer gets one sandwich and one soda. The total time is shorter. For a single worker handling all steps of a process, reordering independent tasks is a safe and effective optimization. This is exactly what a CPU does for a single thread.

Scenario 2: The Multi Threaded Restaurant

Now let’s go to a busy restaurant with a Chef (Producer Thread) and a Waiter (Consumer Thread). The Waiter is in the dining room, and it takes them 10 seconds to walk to the kitchen pass. A ticket comes in for a 15 second toasted sandwich. The process is:

Chef makes the sandwich (the “payload,” 15 seconds).
Chef rings the “Order Up!” bell (the “flag”).
Waiter hears the bell, walks 10 seconds, and picks up the food.

The Chef’s CPU brain sees the 10 second walk as wasted time and spots an “optimization”: “I can ring the bell now. The Waiter will start their 10 second walk. During that time, I can make most of the 15 second sandwich. I will hide the latency of the walk!”

This is the reordering:

Chef rings the bell.
Waiter in the dining room starts walking (10 second timer starts).
Chef starts making the sandwich (15 second timer starts).

The Nightmare Scenario: The Waiter, an independent thread of execution, arrives at the pass after 10 seconds. The Chef is still assembling the sandwich, with 5 seconds of work left. But the bell, the synchronization signal, has rung. The Waiter’s only job is to trust the bell, pick up whatever is on the pass, and deliver it. They grab the empty plate that was pre-positioned, walk back to the dining room, and serve it to the customer with a flourish. The system has failed catastrophically. The very same logic of reordering tasks that was safe and efficient in the food truck has now led to a complete breakdown in the restaurant.

Let’s explore the three levels of discipline we can enforce to prevent this.

Level 1: Relaxed

This is the weakest and fastest level. It makes one simple promise: the operation itself will be atomic. It will not be “torn” halfway through.

The Problem: A Simple Counter Imagine a non atomic, shared counter.

// INCORRECT CODE: This is broken!
long long g_counter = 0;

void increment_counter() {
    // A single C++ line can be multiple machine instructions.
    // This is a "read modify write" operation.
    g_counter++; 
}

That innocent looking g_counter++ is a trap. On the machine level, it is actually three steps:

Read: Load the current value of g_counter from memory into a CPU register.
Modify: Increment the value in the register.
Write: Store the new value from the register back into memory.

The Nightmare Scenario Let’s see how this explodes on a true multi-core CPU, with Thread A on Core 1 and Thread B on Core 2.

Initial State: g_counter is 0 in shared memory.
Core 1 (Thread A) executes step 1: It reads 0 from memory into its private register.
At the same instant, before Core 1 has a chance to write anything back, Core 2 (Thread B) also executes step 1: It also reads 0 from memory into its own private register.
Core 1 executes step 2: It increments its register to 1.
Core 2 executes step 2: It also increments its register to 1.
Core 1 executes step 3: It writes 1 back to g_counter. The shared value is now 1.
Core 2 executes step 3: It also writes 1 back to g_counter, overwriting the value that Core 1 just wrote.

Result: Two threads performed an increment, but the final value is 1. We lost an increment. This is a classic race condition.

The Solution: memory_order_relaxed std::atomic with relaxed ordering solves this specific problem.

// CORRECT (for atomicity)
std::atomic<long long> g_atomic_counter = 0;

void increment_counter() {
    // fetch_add guarantees the read modify write is one,
    // indivisible (atomic) operation.
    g_atomic_counter.fetch_add(1, std::memory_order_relaxed);
}

This guarantees that the read modify write sequence can never be interleaved by another thread. It solves the data race. However, relaxed provides no ordering guarantees. The compiler is still free to reorder this atomic increment relative to any other code around it. It is only useful when you care about the final tally, but not about when each increment happens relative to other events.

Level 2: Acquire and Release

This is the workhorse of lock free programming. It solves our restaurant kitchen problem.

The Problem: The Empty Plate Let’s code our chef scenario and see how it breaks.

// INCORRECT CODE: This is broken!
std::string g_sandwich;
std::atomic<bool> g_bell_rung = false;

// Chef Thread (Producer)
void prepare_order() {
    g_sandwich = "Toasted Sandwich"; // The payload
    // A relaxed store has no ordering guarantees.
    // The CPU is free to reorder this to happen BEFORE the sandwich is ready.
    g_bell_rung.store(true, std::memory_order_relaxed); // The flag
}

// Waiter Thread (Consumer)
void serve_order() {
    if (g_bell_rung.load(std::memory_order_relaxed)) {
        // We heard the bell, but the sandwich may not be on the plate yet!
        // This could print garbage or an empty string.
        std::cout << g_sandwich << std::endl; 
    }
}

The Nightmare Scenario

The compiler or CPU sees prepare_order and decides to reorder the instructions for efficiency.
Chef Thread executes g_bell_rung.store(true, ...) first. The bell is now rung.
The operating system can pause the Chef thread at any moment, or it can simply continue on another core.
Waiter Thread, running on a different core, sees g_bell_rung is true.
Waiter Thread proceeds to read g_sandwich. Since the write has not happened yet, it reads an empty string and serves an empty plate.

The Solution: The Acquire-Release Handshake We enforce the correct order with release (for the producer/chef) and acquire (for the consumer/waiter).

std::memory_order_release: A barrier. All memory reads and writes in the code before this line are guaranteed to be completed before this line. It “publishes” the changes.
std::memory_order_acquire: A barrier. If this read sees a value written by a release, then all the memory operations that happened before that release are now guaranteed to be visible to the code that comes after this line.

// CORRECT
std::string g_sandwich;
std::atomic<bool> g_bell_rung = false;

// Chef Thread
void prepare_order() {
    g_sandwich = "Toasted Sandwich";
    // The release fence guarantees the write to g_sandwich
    // CANNOT be reordered past this point.
    g_bell_rung.store(true, std::memory_order_release);
}

// Waiter Thread
void serve_order() {
    // The acquire fence guarantees that if we see 'true',
    // we also see everything that happened before the release store.
    if (g_bell_rung.load(std::memory_order_acquire)) {
        // This is now guaranteed to print "Toasted Sandwich"
        std::cout << g_sandwich << std::endl;
    }
}

This pairing creates a “synchronizes with” relationship. The sandwich is now guaranteed to be on the plate before the bell is rung.

Level 3: Sequentially Consistent

Acquire release is perfect for producer consumer scenarios. But what about when threads need to agree with each other?

The Problem: A Lock Free Mutex Let’s try to implement a simple mutex using Dekker’s algorithm. Two threads want to enter a critical section. Each will set its own flag to say “I want to enter,” and then check the other’s flag.

// INCORRECT CODE: This is broken!
std::atomic<bool> flag1 = {false};
std::atomic<bool> flag2 = {false};
int shared_resource = 0;

// Thread 1
void work1() {
    // I want to enter
    flag1.store(true, std::memory_order_release);
    // Did the other thread want to enter?
    if (!flag2.load(std::memory_order_acquire)) {
        // Critical Section
        shared_resource++;
    }
    flag1.store(false, std::memory_order_release);
}

// Thread 2
void work2() {
    // I want to enter
    flag2.store(true, std::memory_order_release);
    // Did the other thread want to enter?
    if (!flag1.load(std::memory_order_acquire)) {
        // Critical Section
        shared_resource++;
    }
    flag2.store(false, std::memory_order_release);
}

The Nightmare Scenario The acquire release pairing only synchronizes a store with the load that reads from it. It says nothing about the order of writes to different atomic variables.

Initial State: flag1 and flag2 are false.
Thread 1, on Core 1, executes flag1.store(true, ...)
Thread 2, on Core 2, executes flag2.store(true, ...)
Crucially, there is no guarantee that the change from Core 2 is visible to Core 1 yet, and vice versa. Memory updates take time to propagate between cores.
Thread 1 now executes flag2.load(...). It has not seen the update from Thread 2 yet, so it reads false. It enters the critical section.
Thread 2 now executes flag1.load(...). It has not seen the update from Thread 1 yet, so it also reads false. It also enters the critical section.

Result: Both threads are inside the critical section at the same time. Our mutex has failed completely.

The Solution: memory_order_seq_cst seq_cst creates a single, global timeline for all seq_cst operations. All threads will agree on the order of these operations, no matter which variable they touch.

// CORRECT
std::atomic<bool> flag1 = {false};
std::atomic<bool> flag2 = {false};
int shared_resource = 0;

// Thread 1
void work1() {
    flag1.store(true, std::memory_order_seq_cst);
    if (!flag2.load(std::memory_order_seq_cst)) {
        shared_resource++;
    }
    flag1.store(false, std::memory_order_seq_cst);
}

// Thread 2
void work2() {
    flag2.store(true, std::memory_order_seq_cst);
    if (!flag1.load(std::memory_order_seq_cst)) {
        shared_resource++;
    }
    flag2.store(false, std::memory_order_seq_cst);
}

Now, because of seq_cst, there is a single, globally agreed upon order for the two store(true) operations. One of them must come first in this timeline.

If flag1.store(true) comes first in the global order…
…then when Thread 2 executes flag1.load(seq_cst), it is guaranteed to see true and will not enter the critical section.

This power comes at a cost. seq_cst often requires more expensive CPU instructions (full memory fences). It is the “sledgehammer” of memory ordering, used when the surgical precision of acquire release is not enough to enforce the required guarantees between multiple independent threads.

With these three tools, relaxed, acquire/release, and seq_cst, we are now equipped to take the training wheels off our deque. In Part 3, we will use them to build the owner’s fast, lock free path.