We have the Owner’s code working. Now we need to implement the Thief’s logic, steal_top.
This function is called by other threads when they run out of work. They look at our deque and try to grab the oldest task.
The Thief’s steal_top
The logic here mirrors the Owner’s pop_bottom. It also needs to participate in the seq_cst handshake to handle the race for the last item.
A note on the ABA Problem: In many lock-free data structures, if a value changes from A to B and back to A, a thread might think nothing changed. Here, because our top and bottom indices are strictly increasing 64-bit integers, they will practically never wrap around (it would take centuries). This makes the ABA problem impossible for all practical purposes, simplifying our logic significantly compared to a pointer-based stack.
template<typename T>
bool WSDeque<T>::steal_top(T* out) {
// 1. Load `top`. ACQUIRE ensures we see the latest state.
std::uint64_t t = m_top.load(std::memory_order_acquire);
// 2. FENCE. This matches the fence in `pop_bottom`.
// It ensures our read of `top` happens before we read `bottom`.
std::atomic_thread_fence(std::memory_order_seq_cst);
// 3. Load `bottom`. ACQUIRE ensures that if we see a new bottom,
// we also see the data written to the buffer.
std::uint64_t b = m_bottom.load(std::memory_order_acquire);
if (t < b) {
// There is at least one item.
// It is safe to read from the buffer *before* the CAS because
// we know the item exists at index `t`.
T item = m_buffer[t & m_mask];
// 4. CAS. We try to increment `top` to claim the item.
// This is the final arbiter. If multiple thieves (or the owner)
// are fighting for this item, only one CAS will succeed.
if (m_top.compare_exchange_strong(t, t + 1,
std::memory_order_seq_cst, std::memory_order_relaxed)) {
// Success! We stole the item.
*out = item;
return true;
}
}
// Failed. Either empty, or we lost the race.
return false;
}
How It All Fits Together
The correctness of this deque relies on the interactions between the fences and atomic operations.
1. Push vs. Steal (Producer-Consumer)
- Owner: Writes data ->
releasestore tobottom. - Thief:
acquireload ofbottom-> Reads data. - Result: Safe. The thief is guaranteed to see the valid data if it sees the updated index.
2. Steal vs. Steal (Thieves Fighting)
- Two thieves see the same
top. - Both try to CAS
top->top + 1. - Result: The hardware guarantees only one CAS succeeds. The winner gets the task. The loser gets nothing.
3. Pop vs. Steal (The Critical Race)
- Owner: Decrements
bottom, runsseq_cstfence, checkstop. - Thief: Runs
seq_cstfence, checksbottom. - Result: The
seq_cstfences enforce a global order.- If the Owner’s fence runs first, the Thief sees the decremented
bottomand backs off. - If the Thief’s fence runs first, the Owner sees the old
top(or fails the CAS) and backs off. - It is impossible for both to think they won the last item.
- If the Owner’s fence runs first, the Thief sees the decremented
Practical Tips for the Scheduler
Now that you have a deque, you need a scheduler to use it. Here are two tips for building a robust work-stealing thread pool.
1. Random Victim Selection
When a thief needs to steal, which deque should it target?
Don’t use a complex heuristic. Just pick a random thread. It’s statistically robust and simple.
Crucial: Do not use a shared stateful random number generator (like std::mt19937 or rand()) protected by a lock. That introduces the same contention we just fought to remove. Instead, use a thread_local generator for each thread or a stateless hash function.
2. Exponential Backoff
If a thief tries to steal from everyone and fails, the system is likely under load (or empty). Don’t spin tightly in a loop. You’ll burn CPU cycles and slow down the working threads by contending for cache lines. Implement an exponential backoff:
- Yield (
std::this_thread::yield()). - Sleep for 1 microsecond.
- Sleep for 2 microseconds… up to a cap (e.g., 1ms). Reset the backoff as soon as you find work.
Conclusion
We started this series with a problem: a dynamic, unpredictable task tree that couldn’t be efficiently managed by a single shared queue. We saw how locks create bottlenecks and how false sharing silently kills performance.
By building this lock-free deque, we’ve created a structure that allows every thread to work independently on its own branch of that tree, only interacting when absolutely necessary. This is the engine of modern parallelism, powering systems like Intel TBB and Go’s scheduler.