Work-Stealing Deque Part 4: The Lock-Free Thief

We have the Owner’s code working. Now we need to implement the Thief’s logic, steal_top.

This function is called by other threads when they run out of work. They look at our deque and try to grab the oldest task.

The Thief’s `steal_top`

The logic here mirrors the Owner’s pop_bottom. It also needs to participate in the seq_cst handshake to handle the race for the last item.

A note on the ABA Problem: In many lock-free data structures, if a value changes from A to B and back to A, a thread might think nothing changed. Here, because our top and bottom indices are strictly increasing 64-bit integers, they will practically never wrap around (it would take centuries). This makes the ABA problem impossible for all practical purposes, simplifying our logic significantly compared to a pointer-based stack.

template<typename T>
bool WSDeque<T>::steal_top(T* out) {
    // 1. Load `top`. ACQUIRE ensures we see the latest state.
    std::uint64_t t = m_top.load(std::memory_order_acquire);

    // 2. FENCE. This matches the fence in `pop_bottom`.
    // It ensures our read of `top` happens before we read `bottom`.
    std::atomic_thread_fence(std::memory_order_seq_cst);

    // 3. Load `bottom`. ACQUIRE ensures that if we see a new bottom,
    // we also see the data written to the buffer.
    std::uint64_t b = m_bottom.load(std::memory_order_acquire);

    if (t < b) {
        // There is at least one item.
        // It is safe to read from the buffer *before* the CAS because
        // we know the item exists at index `t`.
        T item = m_buffer[t & m_mask];
        
        // 4. CAS. We try to increment `top` to claim the item.
        // This is the final arbiter. If multiple thieves (or the owner)
        // are fighting for this item, only one CAS will succeed.
        if (m_top.compare_exchange_strong(t, t + 1,
                std::memory_order_seq_cst, std::memory_order_relaxed)) {
            // Success! We stole the item.
            *out = item;
            return true;
        }
    }
    
    // Failed. Either empty, or we lost the race.
    return false;
}

How It All Fits Together

The correctness of this deque relies on the interactions between the fences and atomic operations.

1. Push vs. Steal (Producer-Consumer)

Owner: Writes data -> release store to bottom.
Thief: acquire load of bottom -> Reads data.
Result: Safe. The thief is guaranteed to see the valid data if it sees the updated index.

2. Steal vs. Steal (Thieves Fighting)

Two thieves see the same top.
Both try to CAS top -> top + 1.
Result: The hardware guarantees only one CAS succeeds. The winner gets the task. The loser gets nothing.

3. Pop vs. Steal (The Critical Race)

Owner: Decrements bottom, runs seq_cst fence, checks top.
Thief: Runs seq_cst fence, checks bottom.
Result: The seq_cst fences enforce a global order.
- If the Owner’s fence runs first, the Thief sees the decremented bottom and backs off.
- If the Thief’s fence runs first, the Owner sees the old top (or fails the CAS) and backs off.
- It is impossible for both to think they won the last item.

Practical Tips for the Scheduler

Now that you have a deque, you need a scheduler to use it. Here are two tips for building a robust work-stealing thread pool.

1. Random Victim Selection

When a thief needs to steal, which deque should it target? Don’t use a complex heuristic. Just pick a random thread. It’s statistically robust and simple. Crucial: Do not use a shared stateful random number generator (like std::mt19937 or rand()) protected by a lock. That introduces the same contention we just fought to remove. Instead, use a thread_local generator for each thread or a stateless hash function.

2. Exponential Backoff

If a thief tries to steal from everyone and fails, the system is likely under load (or empty). Don’t spin tightly in a loop. You’ll burn CPU cycles and slow down the working threads by contending for cache lines. Implement an exponential backoff:

Yield (std::this_thread::yield()).
Sleep for 1 microsecond.
Sleep for 2 microseconds… up to a cap (e.g., 1ms). Reset the backoff as soon as you find work.

Conclusion

We started this series with a problem: a dynamic, unpredictable task tree that couldn’t be efficiently managed by a single shared queue. We saw how locks create bottlenecks and how false sharing silently kills performance.

By building this lock-free deque, we’ve created a structure that allows every thread to work independently on its own branch of that tree, only interacting when absolutely necessary. This pattern powers systems like Intel TBB and Go’s scheduler.