Work-Stealing Deque Part 1: The Problem with Locks

Open Table of contents

The Shape of Work
The Naive Approach: The Central Queue
The Solution: Work Stealing
The Hidden Enemy: False Sharing
Building the Foundation: The Locked Deque
- The Power of Two Trick

The Shape of Work

When we talk about parallelizing a game engine or a high-performance server, we aren’t just talking about running a list of identical items. Real-world workloads are dynamic. A physics simulation step spawns collision checks, which spawn contact resolution tasks, which might spawn particle effects.

The workload isn’t a flat list. It’s a tree that grows and shrinks unpredictably.

The Task Tree in Action

In the diagram above, the Root task spawns children A, B, and C. Task A might finish quickly, while C explodes into a massive subtree of sub-tasks (C1, C2…). If we statically assign these tasks to threads, one core will be drowning in work while the others sit idle. We need a system that balances this load dynamically.

The Naive Approach: The Central Queue

The most obvious solution is often the first performance trap we fall into: the single shared queue.

You create one global std::queue protected by a std::mutex. All threads push generated tasks into it, and all threads pull tasks from it. It sounds fair. It sounds simple. And for a small number of threads, it works fine.

But as you scale up, this central queue becomes a bottleneck. Every single thread must fight for the same lock to do anything. It doesn’t matter if you have 4 cores or 64 cores. If they all have to pass through a single synchronization point, you are effectively serializing your parallel program at its busiest junction.

The Turnstile Issue

Think of it like a subway station with only one turnstile. It doesn’t matter how fast the trains are or how many passengers there are. If everyone has to squeeze through that one gate one by one, the system grinds to a halt.

The Solution: Work Stealing

To fix this, we need to decentralize. Instead of one global pile of work, we give every worker thread its own local Deque (Double-Ended Queue).

When a thread spawns a new task (like UpdateSceneGraph spawning UpdateNode), it pushes it onto its own deque. When it needs work, it pops from its own deque. No locks, no contention, no fighting with other threads. It stays in its own lane.

But what happens when a thread runs out of work? This is where the “Stealing” comes in. An idle thread becomes a Thief. It looks at other threads’ deques and “steals” work from them.

Deque

As shown in the diagram, the deque has two ends:

The Bottom: Accessed only by the Owner (the thread that owns this deque). It pushes and pops from here. This acts as a LIFO (Last-In, First-Out) stack. This is crucial for cache locality because the task you just spawned is likely still hot in your L1 cache.
The Top: Accessed by Thieves. They steal from here. This acts as a FIFO (First-In, First-Out) queue. By stealing the oldest item (which is usually the root of a large subtree), the thief grabs a big chunk of work, minimizing the need to come back and steal again.

Even if we give every thread its own deque and avoid locks, we can still kill performance if we aren’t careful about memory layout.

Modern CPUs don’t read single bytes; they read chunks of memory called cache lines (typically 64 bytes). If two variables sit next to each other in memory, they will likely end up on the same cache line.

False Sharing

Look at the diagram above. Core 0 (the Owner) wants to write to m_bottom. Core 1 (a Thief) wants to write to m_top. These two variables are part of the same deque object and naturally sit right next to each other in memory. If they share a 64-byte cache line, the cores will fight over ownership of that line.

When Core 0 writes to m_bottom, it needs exclusive ownership of the cache line. It sends a message to all other cores: “Invalidate your copies of this address!” Core 1’s cache line is marked invalid.

A nanosecond later, Core 1 wants to write to m_top. It finds its cache line is invalid (a “cache miss”). It cannot proceed until it gets the latest version of that cache line from Core 0. The data has to physically travel between the cores. Core 1 gets the data, modifies it, and invalidates Core 0.

This “ping-pong” effect is devastating. It forces the CPU cores to constantly synchronize and transfer data for every single atomic operation, which is orders of magnitude slower than working in their local L1 cache. This is known as False Sharing. The variables aren’t logically shared (one is for the owner, one for the thief), but they are physically shared because they live on the same cache line.

To fix this, we need to ensure that our critical atomic indices (m_top and m_bottom) never reside on the same cache line. We’ll tackle this with cache alignment and padding when we build the high-performance lock-free version in Part 3. For now, just keep in mind that memory layout matters as much as your code logic.

Building the Foundation: The Locked Deque

Before we dive into the complexity of lock-free atomics, let’s build a reference implementation using a lock. This establishes our correctness baseline and introduces the circular buffer logic we’ll use later.

We use a circular buffer because it avoids the dynamic allocation overhead of a linked list. We track the range of valid items using two indices: m_top (start) and m_bottom (end).

The Power of Two Trick

You’ll see this pattern in a lot of high-performance systems:

m_capacity = next_pow2(capacity_hint);
m_mask = m_capacity - 1;

And the access: m_buffer[m_bottom & m_mask]

Why do we do this?

A circular buffer needs to wrap around. The standard way to do this is with the modulo operator: index % capacity. But modulo is actually a division operation in disguise, and division is one of the slowest math instructions a CPU can execute.

However, if your capacity is a power of two (e.g., 16, 32, 64), you can use a bitwise AND instead.

Let’s look at the binary. If capacity is 16 (10000 in binary), then mask is 15 (01111). If our index is 18 (10010):

Modulo: 18 % 16 = 2
Bitwise: 10010 & 01111 = 00010 (which is 2)

The bitwise AND simply chops off the higher bits, keeping the index perfectly wrapped within the range [0, 15]. This operation is extremely fast. By forcing our capacity to be a power of two, we save a division on every single push and pop.

// A helper to round any number up to the next power of two.
static std::size_t next_pow2(std::size_t x) {
    if (x == 0) return 1;
    --x;
    x |= x >> 1;  x |= x >> 2;  x |= x >> 4;
    x |= x >> 8;  x |= x >> 16;
    #if INTPTR_MAX == INT64_MAX
    x |= x >> 32;
    #endif
    return x + 1;
}

template<typename T>
class DequeLocked {
public:
    explicit DequeLocked(std::size_t capacity_hint) {
        m_capacity = next_pow2(capacity_hint);
        m_mask = m_capacity - 1;
        m_buffer.resize(m_capacity);
    }

    // Owner pushes to the bottom (LIFO)
    bool push_bottom(T item) {
        std::unique_lock<std::mutex> lock(m_mtx);
        // Check for overflow
        if (m_bottom - m_top >= m_capacity) return false;
        
        m_buffer[m_bottom & m_mask] = item;
        m_bottom++;
        return true;
    }

    // Owner pops from the bottom (LIFO)
    bool pop_bottom(T* out) {
        std::unique_lock<std::mutex> lock(m_mtx);
        if (m_bottom == m_top) return false; // Empty
        
        m_bottom--;
        *out = m_buffer[m_bottom & m_mask];
        return true;
    }

    // Thieves steal from the top (FIFO)
    bool steal_top(T* out) {
        std::unique_lock<std::mutex> lock(m_mtx);
        if (m_bottom == m_top) return false; // Empty
        
        *out = m_buffer[m_top & m_mask];
        m_top++;
        return true;
    }

private:
    std::vector<T> m_buffer;
    std::size_t m_capacity = 0, m_mask = 0;
    
    // These will need to be atomic later!
    std::uint64_t m_top = 0;
    std::uint64_t m_bottom = 0;
    
    std::mutex m_mtx;
};

This code works. It’s thread-safe. But it’s slow because every push, pop, and steal fights for the same m_mtx.

In the next part, we are going to remove that mutex. To do that, we have to understand the dark art of Memory Ordering.