Skip to content
Go back

Memory Magic Part 3: The Specialist's Toolkit

Suggest an edit

Part 1 Part 2 Part 3 Part 4

Table of contents

Open Table of contents

The Naive Pool

In Part 2, we built the Arena, which is ideal for objects with shared lifetimes. However, systems like particle effects or entity lists often require individual object lifetimes. Using an Arena for these results in fragmentation, as freed slots cannot be reclaimed until the entire arena is reset.

To address this, we need a Pool Allocator. A pool manages objects of a fixed size, allowing us to recycle memory efficiently.

The simplest implementation is an array of slots with a boolean flag indicating active status.

struct Particle {
    bool is_active;
    float position[2];
    float velocity[2];
    float lifetime_remaining;
};

struct ParticlePool {
    Arena* backing_arena;
    size_t capacity;
};

void InitPool(ParticlePool* pool, Arena* backing_arena) {
    pool->backing_arena = backing_arena;
    pool->capacity = 0;
}

Particle* PoolAlloc(ParticlePool* pool) {
    pool->capacity++;
    return (Particle*)ArenaAlloc(pool->backing_arena, sizeof(Particle));
}

This approach has significant flaws.

  1. Padding: The bool is_active often introduces padding bytes due to alignment, wasting memory per object.
  2. O(N) Allocation: Finding a free slot requires a linear scan of the array. As the pool fills, allocation performance degrades linearly.
  3. Fragmentation: Freed slots create “holes” that are expensive to locate and reuse.

Naive Pool

The Free List

To address the O(N) allocation and fragmentation issues, we need a mechanism to locate free slots in O(1) time. A common initial approach is to maintain a separate, external container to track available slots.

Approach 1: The External Free List

We could maintain a std::vector<Particle*> containing pointers to every free slot.

// Hypothetical external list
std::vector<Particle*> free_slots;

void Free(Particle* p) {
    p->is_active = false;
    free_slots.push_back(p);
}

Particle* Alloc() {
    if (!free_slots.empty()) {
        Particle* p = free_slots.back();
        free_slots.pop_back();
        return p;
    }
    // ... allocate new ...
}

While this provides O(1) access, it introduces significant overhead:

  1. Double Allocation: The std::vector itself requires dynamic memory allocation, defeating the purpose of a custom allocator.
  2. Cache Inefficiency: The vector lives in a different region of memory than the pool, reducing cache locality.

Approach 2: The Intrusive Linked List

We can eliminate this overhead by observing that a “dead” particle’s memory is unused. We can repurposed this memory to store the free list node itself. This is known as an intrusive linked list.

By using a union, we can overlay a next_free pointer on top of the particle’s data fields. This ensures that the free list consumes zero additional memory.

struct Particle {
    // is_active is useful for debugging or iteration, but not strictly necessary for the free list itself
    bool is_active; 

    union {
        // State when is_active = true
        struct {
            float position[2];
            float velocity[2];
            float lifetime_remaining;
        } data;
        
        // State when is_active = false
        Particle* next_free;
    };
};

When a particle is freed, we write the address of the current free_list_head into its next_free field and update the head to point to this particle. This effectively pushes the slot onto a stack threaded through the unused memory.

Free List Pool

Implementation

The pool maintains a pointer to the head of the free list.

struct ParticlePool {
    Arena* backing_arena;
    Particle* free_list_head;
    size_t capacity;
};

void InitPool(ParticlePool* pool, Arena* backing_arena) {
    pool->backing_arena = backing_arena;
    pool->free_list_head = nullptr;
    pool->capacity = 0;
}

void PoolFree(ParticlePool* pool, Particle* p) {
    if (!p) return;

    p->is_active = false;
    // Push onto the free list stack
    p->next_free = pool->free_list_head;
    pool->free_list_head = p;
}

Particle* PoolAlloc(ParticlePool* pool) {
    // 1. Try to recycle from the free list
    if (pool->free_list_head) {
        Particle* recycled = pool->free_list_head;
        pool->free_list_head = recycled->next_free;
        recycled->is_active = true;
        return recycled;
    }

    // 2. Fallback: Allocate fresh memory from the arena
    pool->capacity++;
    Particle* new_particle = (Particle*)ArenaAlloc(pool->backing_arena, sizeof(Particle));
    if (new_particle) {
        new_particle->is_active = true;
    }
    return new_particle;
}

This provides O(1) allocation and deallocation. However, it does not preserve memory locality. Over time, active objects become scattered throughout the array, which can degrade cache performance during iteration.

The Packed Array

The Free List solves allocation speed, but it introduces a new problem: Iteration Performance.

In a Free List, active objects are scattered throughout the memory block. Iterating requires checking every slot’s is_active flag.

  1. Cache Misses: We load cache lines containing inactive data, wasting bandwidth.
  2. Branch Misprediction: The random pattern of active/inactive slots confuses the CPU’s branch predictor.

For systems where iteration is the dominant operation (e.g., updating 10,000 particles per frame), we need Contiguity.

The “Swap and Pop” Strategy

To ensure optimal iteration, we must keep all active objects packed in a contiguous block at the start of the array. We achieve this with the “Swap and Pop” removal technique.

The invariant is: active_count always represents the number of contiguous active elements.

void RemoveAt(ParticlePool* pool, size_t index) {
    size_t last_index = pool->active_count - 1;
    
    // Move the last element into the hole
    pool->particles[index] = pool->particles[last_index];
    
    pool->active_count--;
}

This operation fills the “hole” created by deletion with the last element, preserving the dense packing.

Swap and Pop

The result is a loop that compiles to a simple pointer increment, maximizing instruction throughput and cache efficiency.

for (size_t i = 0; i < pool.active_count; ++i) {
    Particle* p = &pool.particles[i];
    // ... update ...
}

The trade-off is pointer invalidation. When an object is moved to fill a hole, any external pointers to that object become invalid (dangling) or incorrect (pointing to the moved object). This makes the Packed Array unsuitable for systems where stable references are required.

The Slot Map (Generational Handles)

The Packed Array offers perfect iteration but suffers from Pointer Invalidation. When “Swap and Pop” moves an object, any pointer referring to it becomes invalid (pointing to the old location, which now holds a different object).

To support both fast iteration and stable references, we need a Slot Map. This architecture introduces a layer of indirection: instead of a raw pointer, we give the user a Handle.

A Handle is a composite key containing a stable Index and a Generation ID.

Step 1: Indirection

We decouple the “identity” of a particle from its physical location.

When we allocate a particle, we assign it a stable ID. This ID never changes, even if the particle’s data moves in the Dense Array.

Example:

  1. Allocate Particle A (ID 0) at Dense Index 0. sparse[0] = 0.

Handle Example 1

  1. Allocate Particle B (ID 1) at Dense Index 1. sparse[1] = 1.

Handle Example 2

  1. Allocate Particle C (ID 2) at Dense Index 2. sparse[2] = 2.

Handle Example 3

The Swap: If we free Particle B (ID 1), we must fill the hole at Dense Index 1 with the last element, Particle C.

  1. Move C’s data from Dense Index 2 to Dense Index 1.
  2. Update the Sparse Array: sparse[2] must now point to Dense Index 1.
  3. Update the Reverse Map: dense_to_sparse[1] is now 2 (ID of C).

Handle Example 4

This two-way mapping ensures handles remain valid even as the underlying data moves to maintain cache locality.

Step 2: The ABA Problem

Indirection alone is insufficient. If we free an ID and then immediately reuse it for a new object, an old handle referring to the previous object will now point to the new one. This is the ABA Problem, leading to subtle bugs where code modifies the wrong entity.

Step 3: Generational Handles

To solve this, we add a version number, or Generation, to each slot in the Sparse Array.

  1. Handle = Index + Generation: The handle given to the user contains both the Sparse Index and the Generation at the time of allocation.
  2. Validation: When accessing a slot, we compare Handle.Generation with Slot.Generation.
  3. Increment: When a slot is freed, we increment its Generation.

If the generations match, the handle is valid. If they differ, the slot has been reused, and we can safely reject the access.

Implementation

struct ParticleHandle {
    uint32_t sparse_index;
    uint32_t generation;
};

struct SparseEntry {
    uint32_t dense_index;
    uint32_t generation;
};

struct ParticlePool {
    Particle* particles;        // Dense data
    uint32_t* dense_to_sparse;  // Reverse mapping
    SparseEntry* sparse;        // Lookup table
    
    uint32_t active_count;
    uint32_t first_free_sparse; // Head of sparse free list

    ParticleHandle Alloc() {
        // 1. Get a free sparse slot
        uint32_t sparse_idx = first_free_sparse;
        first_free_sparse = sparse[sparse_idx].dense_index; // Pop from free list

        // 2. Append to dense array
        uint32_t dense_idx = active_count++;
        
        // 3. Link them
        sparse[sparse_idx].dense_index = dense_idx;
        dense_to_sparse[dense_idx] = sparse_idx;
        
        // 4. Return handle with current generation
        return { sparse_idx, sparse[sparse_idx].generation };
    }

    void Free(ParticleHandle handle) {
        if (!IsValid(handle)) return;

        uint32_t dense_idx = sparse[handle.sparse_index].dense_index;
        uint32_t last_dense = --active_count;

        // 1. Swap & Pop in Dense Array
        particles[dense_idx] = particles[last_dense];

        // 2. Update mappings for the moved particle
        uint32_t moved_sparse = dense_to_sparse[last_dense];
        sparse[moved_sparse].dense_index = dense_idx;
        dense_to_sparse[dense_idx] = moved_sparse;

        // 3. Retire the sparse slot
        sparse[handle.sparse_index].generation++; // Invalidate old handles
        sparse[handle.sparse_index].dense_index = first_free_sparse; // Push to free list
        first_free_sparse = handle.sparse_index;
    }

    bool IsValid(ParticleHandle handle) {
        return sparse[handle.sparse_index].generation == handle.generation;
    }
};

The Performance Showdown: By the Numbers

Now that we’ve built our arsenal of pool designs, let’s put them to the test. We’ll simulate a particle system under stress, measuring three key phases:

These benchmarks were run with varying total allocations, free rates, and iteration counts to expose each design’s strengths and weaknesses. We’ll walk through each test set, highlighting what the numbers reveal about trade-offs like cache efficiency, allocation speed, and iteration bottlenecks.

Ran on a M4 Pro 12-Core Macbook

Set 1: High Churn (95% Free) - 10,000 Allocs, 9,500 Frees, 1,000 Iterations, 500 Churn/1,000 Frames

This set models a sparse system with lots of frees (e.g., short-lived particles exploding on impact).

DesignAlloc/Free Time (ms)Churn Time (ms)Iteration Time (ms)
Naive Pool12.9935.912.79
Free List Pool0.170.445.48
Packed Array0.270.390.31
Slot Map0.151.190.31

Insights:

Set 2: Low Churn (5% Free) - 10,000 Allocs, 500 Frees, 1,000 Iterations, 500 Churn/1,000 Frames

Here, we have a dense system with few frees (e.g., long-lived entities like NPCs).

DesignAlloc/Free Time (ms)Churn Time (ms)Iteration Time (ms)
Naive Pool12.94613.376.51
Free List Pool0.130.517.63
Packed Array0.150.806.92
Slot Map0.151.697.00

Insights:

Set 3: High Alloc/Free Cost - 100,000 Allocs, 99,500 Frees, 1,000 Iterations, 500 Churn/1,000 Frames

Scaling up to test raw alloc/free scalability (e.g., massive simulations).

DesignAlloc/Free Time (ms)Churn Time (ms)Iteration Time (ms)
Naive Pool1244.4634.8425.13
Free List Pool1.410.4632.69
Packed Array2.730.390.31
Slot Map1.621.190.31

Insights:

Set 4: High Iteration Cost - 10,000 Allocs, 9,500 Frees, 100,000 Iterations, 500 Churn/1,000 Frames

Cranking iterations to extremes (e.g., stress-testing update loops).

DesignAlloc/Free Time (ms)Churn Time (ms)Iteration Time (ms)
Naive Pool12.7835.44273.18
Free List Pool0.150.43542.18
Packed Array0.260.3830.79
Slot Map0.151.1730.62

Insights:

What We’ve Learned from the Numbers

The big lesson? There’s no universal “best”, it hinges on your priorities. High churn? Lean toward Free List for raw efficiency. Iteration-heavy? Packed or Slot Map will transform your frame times. Need safety too? Slot Map reigns supreme as the balanced powerhouse. Always benchmark in your real app. These tests are a starting point, but your data patterns might flip the script!

Conclusion

We have moved from the raw power of the arena to specialized tools. The Free List, Packed Array, and Slot Map aren’t just algorithms. They are solutions to specific data problems.

Data-oriented design isn’t about abstract rules. It is about understanding hardware reality: cache lines, pointer indirection, and branch prediction. When you align your data structure with your access pattern, you get performance for free. These aren’t optimizations you sprinkle on top later. They are architectural decisions that make your system robust and scalable from the start.


Suggest an edit
Share this post on:

Previous Post
Memory Magic Part 4: The Infinite Buffer
Next Post
Memory Magic Part 2: The Arena Allocator