Table of contents
Open Table of contents
The Performance Trap We All Fall Into
You have a shiny new thread pool. You have a mountain of small tasks to run in parallel. The obvious next step is to create a single, shared queue where all threads can submit work and grab new tasks. It feels clean. It feels fair.
It is a trap.
This central queue becomes a bottleneck, silently strangling your performance. Imagine a stadium with thousands of fans (your tasks) trying to exit through a single turnstile (your queue). It doesn’t matter how fast the fans run; they are all serialized at that one choke point. This is what happens to your threads.
Let’s dissect this turnstile and understand the two invisible monsters that devour your performance.
Monster 1: The Lock
To prevent chaos, a simple shared queue needs a lock (like a std::mutex
). Only one thread can hold the lock at a time. When Thread A is pushing a task, Threads B, C, and D are all lined up, waiting. Your parallel system has become a single-file line. The very tool you used to enable concurrency is now forcing your threads to wait their turn.
Monster 2: The Shared Notebook
This monster is more subtle and far more vicious. To understand it, we need to know one simple thing about modern CPUs: they have their own private, super-fast notepads called caches. Instead of working directly with slow main memory (RAM), a CPU core copies the data it needs onto its personal notepad.
But here’s the catch: data is not copied byte-by-byte. It’s copied in chunks called cache lines, which are typically 64 bytes long.
Now, imagine two coworkers, Alice and Bob, sitting at different desks. They need to keep track of their own, separate to-do lists, but for some reason, both lists are written in the same physical notebook.
- Alice needs to add an item to her list on page 1. She grabs the notebook and starts writing.
- Bob now needs to add an item to his list on page 2. He can’t. The notebook is at Alice’s desk. He has to interrupt her, wait for her to finish, and have the entire notebook passed over to him.
- A moment later, Alice needs to update her list again. The process repeats: she has to interrupt Bob and get the notebook back.
They are constantly fighting over the notebook, even though they are working on completely separate lists! The notebook itself is the bottleneck.
This is exactly what happens inside your computer. The notebook is the cache line. Alice’s list and Bob’s list are two different variables (like our queue’s head
and tail
pointers) that happen to live next to each other in memory, inside the same 64-byte chunk. Even though threads are modifying different variables, they are fighting over the same cache line, causing a massive, hidden slowdown called false sharing.
A New Way of Thinking: From a Pile of Bricks to a Blueprint
The turnstile problem is often caused by thinking about work as a static “pile of bricks.” A main thread acts like a foreman, figures out all 10,000 bricks that need to be laid, and dumps them into a central pile for the workers. This model guarantees maximum contention.
The work-stealing model uses a more powerful paradigm. Instead of a pile of bricks, you give a worker a blueprint.
Imagine the main task is UpdateSceneGraph(rootNode)
. This is the blueprint. You give this single task to Worker 1.
- Worker 1 (The Owner) starts executing the task. It looks at the
rootNode
and sees it has four children: A, B, C, and D. Instead of processing them sequentially, it realizes these are four large, independent sub-problems. - It generates four new sub-tasks:
UpdateSceneGraph(A)
,UpdateSceneGraph(B)
,UpdateSceneGraph(C)
, andUpdateSceneGraph(D)
. - It pushes these four “blueprints” for the sub-trees onto its own personal to-do list.
- It then immediately grabs the last one it pushed,
UpdateSceneGraph(D)
, and starts working on it. As it processes node D, it will find D’s children (D1, D2, …) and push tasks for them onto its list, continuing to work deeper into the tree.
This is the core idea: a task generates more, smaller tasks for its own thread. Work is not a static pile; it’s a tree that grows organically as it’s explored.
The Magic of the Steal
Now, what happens when Worker 2 becomes idle?
- Worker 2 (The Helper/Thief) sees that Worker 1 has a list of pending tasks.
- It steals one task from Worker 1’s list. But which one? It steals the oldest one:
UpdateSceneGraph(A)
. - This is the “Aha!” moment. Worker 2 didn’t just steal a single brick. It stole the entire blueprint for subtree A.
- Worker 2 now starts executing
UpdateSceneGraph(A)
. It discovers node A’s children (A1, A2, …) and starts generating its own local sub-tasks, pushing them onto its own to-do list.
The thief has become an owner. It is now self-sufficient, exploring its own branch of the work tree without needing to bother anyone else. The initial steal distributed a large, valuable chunk of the problem, dramatically reducing the need for future stealing. This is how the system scales.
Designing for the Task Tree: The Deque’s LIFO/FIFO Magic
To support this model, we need a special kind of to-do list. A deque (double-ended queue) is designed for this exact purpose.
-
The Owner works at the
bottom
(LIFO): When a worker spawns new sub-tasks, it pushes them onto thebottom
of its own deque. It then immediately pops a task from thebottom
to work on. This Last-In, First-Out (LIFO) behavior is like a stack. It ensures the worker is always processing its most recently generated, cache-hot task. This is its fast path. -
Helpers steal from the
top
(FIFO): When another worker runs out of tasks, it steals from thetop
, the oldest task in the deque. This First-In, First-Out (FIFO) behavior gives the helper a larger, older chunk of the work tree (like subtree ‘A’ in our example), which is less likely to be contended.
This LIFO/FIFO split is the deque’s secret sauce. It’s a design that perfectly maps to the Task Tree model, maximizing local performance while enabling balanced sharing.
Building Our First Deque: The Training Wheels Version
Before we can build a lock-free beast, we must first build a simple, correct, and easy-to-understand version. This “training wheels” deque will use a single lock to guarantee correctness. It will be slow, but it will teach us the fundamental behavior we need to preserve when we remove the lock later.
The Foundation: A Ring Buffer
A deque needs a place to store its items. We’ll use a ring buffer: a fixed-size array that pretends to be infinite by wrapping around on itself. The trick to making this fast is to set its capacity to a power of two (e.g., 16, 32, 1024).
If the capacity is (2^k), we can find the array slot for any ever-increasing index i
with a single, fast bitwise AND operation:
slot = i & (capacity - 1)
This lets our pointers grow forever without needing to reset them.
Assembling the Pieces
Now let’s put these concepts into a C++ struct.
// deque_locked.h
// A helper to round any number up to the next power of two.
inline std::size_t next_pow2(std::size_t x) {
if (x == 0)
return 1;
--x;
x |= x >> 1;
x |= x >> 2;
x |= x >> 4;
x |= x >> 8;
x |= x >> 16;
#if INTPTR_MAX == INT64_MAX
x |= x >> 32;
#endif
return x + 1;
}
struct DequeLocked {
// The ring buffer itself.
void** buffer = nullptr;
std::size_t capacity = 0;
std::size_t mask = 0;
// The two pointers. They grow forever.
std::uint64_t top = 0;
std::uint64_t bottom = 0;
// The single guardian for this version.
std::mutex mtx;
};
// Function to initialize our deque.
bool dq_init(DequeLocked* q, std::size_t capacity_hint) {
assert(q);
q->capacity = next_pow2(capacity_hint);
q->mask = q->capacity - 1;
q->buffer = new void*[q->capacity];
return (q->buffer != nullptr);
}
// Function to clean up.
void dq_destroy(DequeLocked* q) {
if (!q) {
return;
}
delete[] q->buffer;
}
Making It Work: Push, Pop, and Steal
With our structure defined, let’s implement the core operations.
Pushing to the Bottom (Owner’s Action)
The owner adds its newly-spawned sub-tasks to the bottom
of the deque.
bool dq_push_bottom(DequeLocked* q, void* item) {
std::unique_lock<std::mutex> lock(q->mtx);
// If the deque is full, return false.
if (q->bottom - q->top >= q->capacity) {
return false;
}
// Find the slot using our mask trick.
std::size_t slot = q->bottom & q->mask;
q->buffer[slot] = item;
// Advance the bottom pointer.
q->bottom++;
return true;
}
Popping from the Bottom (Owner’s Action)
The owner immediately takes the task it just pushed. This LIFO behavior keeps its working data hot in the cache.
bool dq_pop_bottom(DequeLocked* q, void** out) {
std::unique_lock<std::mutex> lock(q->mtx);
// If the deque is empty, return false.
if (q->bottom == q->top) {
return false;
}
// Move the bottom pointer back to the item we want to take.
q->bottom--;
std::size_t slot = q->bottom & q->mask;
*out = q->buffer[slot];
return true;
}
Stealing from the Top (Helper’s Action)
An idle helper steals the oldest task from the top
, giving it a large, independent piece of the task tree to work on.
bool dq_steal_top(DequeLocked* q, void** out) {
std::unique_lock<std::mutex> lock(q->mtx);
// If the deque is empty, return false.
if (q->bottom == q->top) {
return false;
}
// Find the oldest item at the top.
std::size_t slot = q->top & q->mask;
*out = q->buffer[slot];
// Advance the top pointer.
q->top++;
return true;
}
Where We Are and What’s Next
We have successfully built a deque that is correct.
- Its behavior is perfectly matched to the “Task Tree” paradigm: LIFO for the owner, FIFO for helpers.
- The logic is easy to follow because the single lock prevents any strange race conditions.
However, it is also slow. We escaped the single shared queue problem only to put a lock on every single personal deque. If multiple threads try to steal from the same victim, they will line up on that victim’s lock.
This is an essential first step. We have defined the correct behavior. In the next parts of this series, we will take the training wheels off. We will remove the lock for the owner’s operations, making their common push/pop path incredibly fast using the magic of atomic operations and carefully chosen memory ordering.