Skip to content
Go back

Memory Magic Part 3: The Specialist's Toolkit

Suggest an edit

Table of contents

Open Table of contents

The Arena’s Edge Case

In Part 2, we forged the Arena: a tool of immense power based on a simple philosophy: allocate with lightning speed by bumping a pointer, and free everything all at once. This is perfect for short-lived data or for data that shares a collective lifetime. But the Arena, like any specialist, has its edge cases. What if your system’s lifetime is long, yet its contents churn constantly, like particles spawning and dying by the thousands every second?

This calls for a new philosophy: specialization. We will engineer a Pool Allocator, designed to manage the rapid allocation and deallocation of objects that are all the same size. Our journey will take us through distinct designs, from simple to sophisticated. But remember, “better” is relative, each step solves specific problems, but the right choice depends on your exact circumstances, like how often you iterate vs. how much you need stable references.

Act 1: The Naive Pool and the Fragmentation Problem

Let’s start with the most basic possible implementation. A “pool” is just a collection of fixed-size slots. We can create this by taking a large chunk of memory from our arena and simply deciding to treat it as an array of Particle objects.

struct Particle {
    bool is_active;
    float position[2];
    float velocity[2];
    float lifetime_remaining;
};

// For now, the pool just tracks how many slots we've ever used.
struct ParticlePool {
    Arena* backing_arena;
    size_t capacity; // How many slots we can iterate over
};

void InitPool(ParticlePool* pool, Arena* backing_arena) {
    pool->backing_arena = backing_arena;
    pool->capacity = 0;
}

// Allocation is just a bump in the arena.
Particle* PoolAlloc(ParticlePool* pool) {
    pool->capacity++;
    return (Particle*)ArenaAlloc(pool->backing_arena, sizeof(Particle));
}

This works for spawning our initial burst of particles. But the moment a particle’s lifetime_remaining hits zero, we have a problem. What do we do? We can set its is_active flag to false, but its memory slot is still occupied. If we just call PoolAlloc again, it will give us a brand new slot from the arena, leaving the old one empty and unusable.

This creates “holes” in our memory block. This problem is a form of internal fragmentation: the memory is allocated to our pool, but it’s not being used for any active objects.

Naive Pool

We’ve got a basic pool, but fragmentation is wasting memory. Time to recycle…

Act 2: Solving Fragmentation with a Free List

The solution to our fragmentation problem is obvious: we need to recycle the memory. When a particle is freed, we should add it to a list of available slots. The next time PoolAlloc is called, it should check this free list first.

Step 1: The External List Idea

Our first thought might be to create a separate, external list to track the free slots.

// A thought experiment - don't do this!
std::vector<Particle*> free_particle_pointers;

When we free a particle, we push_back its pointer onto this vector. When we allocate, we check if the vector is empty. If not, we pop a pointer off the back and use it. This works, but it feels wrong. We’re trying to build a lean, screaming-fast allocator, and we’ve just introduced the overhead of a complex data structure that does its own heap allocations to manage it. We are defeating the purpose.

We can do better. We can do it with zero extra memory. But how? Where could we possibly store the pointers for our linked list without mallocing a single extra byte?

Step 2: The Zero-Overhead Trick

Let’s look closely at a “dead” particle. We’ve set its is_active flag to false. What about the rest of its data? Its position, velocity, and lifetime_remaining are now completely useless. They are just bytes occupying space, waiting to be overwritten. This is wasted space.

What if we could repurpose it?

What if, the moment a particle becomes inactive, we could treat the memory that used to hold its float data as a place to store a single pointer, a pointer to the next free slot in our list?

This is not just a hypothetical idea. The perfect tool for this job is the union. A union is a special kind of struct where all members occupy the same memory location. By placing our active data and our next_free pointer inside a union, we are explicitly telling the compiler that these two things are mutually exclusive: a particle’s memory is either used for game data or it’s used as a link in the free list, but never both at the same time.

This is the magic trick. We will build our free list inside the memory of the dead objects themselves.

// The new, improved Particle struct
struct Particle {
    bool is_active; // Always accessible

    union {
        // State when is_active = true
        struct {
            float position[2];
            float velocity[2];
            float lifetime_remaining;
        } data;
        // State when is_active = false
        Particle* next_free;
    };
};

Now, the memory of a dead particle can be repurposed to become a node in a singly-linked list. Our pool only needs to store a single pointer: the head of this list. This is a true zero-overhead free list.

Free List Pool

Step 3: Implementation

With this clever structure in place, the implementation becomes beautifully simple.

struct ParticlePool {
    Arena* backing_arena;
    Particle* free_list_head;
    size_t capacity;
};

// InitPool remains mostly the same, just initialize the head pointer.
void InitPool(ParticlePool* pool, Arena* backing_arena) {
    pool->backing_arena = backing_arena;
    pool->free_list_head = nullptr;
    pool->capacity = 0;
}

void PoolFree(ParticlePool* pool, Particle* p) {
    if (!p) return;

    p->is_active = false;
    // Prepend the freed particle to the free list.
    p->next_free = pool->free_list_head;
    pool->free_list_head = p;
}

Particle* PoolAlloc(ParticlePool* pool) {
    // First, try to recycle from the free list.
    if (pool->free_list_head) {
        Particle* recycled = pool->free_list_head;
        pool->free_list_head = recycled->next_free; // Pop from list
        recycled->is_active = true;
        return recycled;
    }

    // If the free list is empty, get fresh memory from the arena.
    pool->capacity++;
    Particle* new_particle = (Particle*)ArenaAlloc(pool->backing_arena,
                                                   sizeof(Particle));
    if (new_particle) {
        new_particle->is_active = true;
    }
    return new_particle;
}

This design is a huge improvement. Allocation and deallocation are incredibly fast. It provides pointers that are stable as long as the object isn’t freed when PoolAlloc returns a Particle*, that pointer’s value is fixed and will remain valid until PoolFree is called on it. Other systems can safely store this pointer, but beware: if you hold onto the pointer after freeing and the memory is reused for a new particle, it will point to a different object. This is a potential source of bugs if not managed carefully.

We’ve solved fragmentation, but is this always “better”? Not necessarily, if your system doesn’t have high churn or if iteration isn’t a bottleneck, the naive pool might be simpler and sufficient. But for many cases, this free list is a solid step up… until we hit the next hurdle.

Act 3: The Hidden Flaw - The Iteration Problem

We have solved the recycling problem, but we have created a new, subtle performance issue, one that may or may not matter depending on your workload. For a particle system, the single most common operation is iterating over all active particles every frame to update their position and check their lifetime.

With our current design, how do we find all the active particles? They are scattered throughout the memory block allocated by the arena. We have no choice but to walk the entire block from start to finish and check each one.

Particle* all_particles = (Particle*)pool->backing_arena->base_ptr;
for (size_t i = 0; i < pool->capacity; ++i) {
    Particle* p = &all_particles[i];
    if (p->is_active) { // <-- This check is the performance killer
        // ... update and render ...
    }
}

If our pool has a capacity of 10,000 but only 500 particles are active, we are wasting 95% of our loop’s time just checking a flag. This is death by a thousand cuts. The unpredictable if statement can also cause CPU branch mispredictions, stalling the processor’s pipeline and making the problem even worse.

We have traded good iteration performance for easy allocation. This is a bad trade if iteration is your hot path, but if your system prioritizes fast alloc/free over frequent loops (e.g., sparse access patterns), the free list might still be ideal. In data-oriented design, we must optimize for the most common access pattern, and if that pattern is iteration, we need a different approach.

Act 4: The Packed Array - A Quest for Perfect Iteration

Our new goal is to design a data structure where all active objects are packed tightly together in one contiguous block. This would allow for a simple, cache-perfect loop with no wasted checks, if iteration is indeed your bottleneck.

This technique is called a Packed Array, and it’s maintained with a clever removal strategy often called “swap-and-pop”.

The rule is simple and absolute: All active particles live in a contiguous block at the beginning of the array, from index 0 to active_count - 1.

The hole is instantly filled, and the contiguous block is preserved. The update loop becomes a thing of beauty:

// Perfect iteration. No branches, perfect cache locality.
for (size_t i = 0; i < pool.active_count; ++i) {
    Particle* p = &pool.particles[i];
    // ... update and render ...
}

However, this deallocation step involves copying the entire particle data from the end to the hole. If your particles are small, this is fine, but if they’re large or if you have extremely high churn (e.g., particles with very short lifetimes, dying and spawning thousands per frame), the repeated copying can add up to significant overhead. In such cases, the performance gain in iteration might be offset by slower deallocations, making this less ideal for ultra-short-lived objects compared to the free list’s near-zero-cost frees.

Swap and Pop

This is a performance win for iteration-heavy systems, but it’s not universally “better”, it complicates deallocation and introduces a new issue we’ll see next. If your use case involves infrequent iteration or needs stable pointers more than speed, stick with the free list.

Act 5: The Unstable Pointer Crisis

We have achieved the perfect loop. We have eliminated wasted work and maximized cache performance. But in doing so, we have introduced a new, far more dangerous problem, one that could make this design unsuitable for systems requiring reliable external references.

Consider this scenario:

  1. The UI system wants to draw a special icon over a specific particle, so it asks for and stores a pointer: Particle* p_special = &pool.particles[5];
  2. A moment later, the physics system determines that particle 5 has hit a wall and calls PoolFree(pool, 5).
  3. The pool, following the swap-and-pop rule, copies the data from the last particle (say, at index 99) into slot 5.
  4. Now, when the UI system dereferences its p_special pointer, it is no longer pointing to the particle it was given. It’s pointing to a completely different particle that just happens to occupy the same memory address.

This is a dangling pointer. We have solved the performance problem but created a correctness nightmare that can lead to silent data corruption, impossible-to-debug visual glitches, or outright crashes. We cannot safely pass raw pointers to objects in a packed array.

This is the ultimate architect’s dilemma: we have a conflict between iteration performance and reference stability. If your system doesn’t need external pointers (e.g., all logic is internal), the packed array is fantastic. But if stability is key, we need one more layer.

Act 6: The Final Architecture - Generational Handles

The answer is to introduce a layer of indirection. Instead of giving other systems a direct pointer (an object’s home address), we will give them a Handle (an object’s unique ID card). This handle can then be used to safely look up the object’s current address, if your problem demands both fast iteration and stable references.

Let’s build this step by step, starting with a simple example to illustrate the concepts.

Step 1: The Indirection Table

First, let’s understand why we need indirection. In the packed array, particles move around during deallocations, their positions (indices) in the array change due to swap-and-pop. A direct pointer like &pool.particles[5] becomes invalid if something else gets swapped into index 5.

To fix this, we decouple the “identity” of a particle from its physical location. When we allocate a particle, we assign it a stable ID, a unique number that never changes, even if the particle’s data moves in the array. Think of the ID as a permanent name tag.

Now, to find a particle’s current location, we need a way to look up “Where is ID 42 right now?” That’s where the indirection table comes in. We’ll use an array called sparse_to_dense: the index of this array is the ID, and the value stored there is the current index in the packed array.

For example, suppose we have a small pool with capacity 4:

Handle Example 1

Handle Example 2

Handle Example 3

If we free Particle B (ID 1, at index 1), swap-and-pop moves Particle C into index 1 and reduces active_count to 2. We update the table: sparse_to_dense[2] = 1 (C is now at 1).

But wait, during the swap, how do we know which ID was at the end to update its entry? We need a reverse mapping: another array called dense_to_sparse, where the index is the packed position, and the value is the ID.

Before the free: dense_to_sparse[0] = 0 (A), [1] = 1 (B), [2] = 2 (C).

After swap: Move C’s data to index 1, then set dense_to_sparse[1] = 2, and update sparse_to_dense[2] = 1.

Handle Example 4

This two-way mapping ensures we can quickly find and update locations without slow searches. It’s like a phone book (sparse_to_dense for ID to address) and a reverse directory (dense_to_sparse for address to ID).

This indirection adds a small cost (array lookups) but allows handles to remain valid even as indices change.

Step 2: The Stale Handle Problem - Using Generations

With just IDs and tables, there’s still a risk. Suppose we free particle A (ID 0), making ID 0 available. Then we allocate a new particle C, which reuses ID 0. An old handle with ID 0 now incorrectly points to C instead of the original A!

This is the “stale handle” issue, reused IDs can fool old handles.

To solve this, we add a generation count to each ID, stored directly in the sparse array alongside the dense index. Think of it as a version number for each slot.

This prevents reuse bugs. For example, the old handle for A fails the check, while new ones for C succeed.

Step 3: The Final Implementation

Putting it all together, this is the complete solution, often called a Slot Map. It gives us the best of all worlds: cache-perfect iteration and completely safe, stable external references… for problems that need both. If your system is simpler, earlier designs might be lighter and sufficient.

Here’s a simplified overview:

// Handle stores sparse index and generation
struct ParticleHandle {
    uint32_t sparse_index;
    uint32_t generation;
};

// Sparse entry holds current dense index and generation
struct SparseEntry {
    uint32_t dense_index;  // Current position in packed array
    uint32_t generation;   // Version number
};

struct ParticlePool {
    // ... packed particle_data array, active_count, etc.
    uint32_t* dense_to_sparse;  // Packed index -> sparse index
    SparseEntry* sparse;        // Sparse index -> {dense_index, generation}

    ParticleHandle Alloc() {
        // Find a free sparse index (e.g., next available or from free list)
        uint32_t sparse_id = get_free_sparse_id();

        // Place in packed array
        uint32_t dense_id = active_count++;
        sparse[sparse_id].dense_index = dense_id;
        dense_to_sparse[dense_id] = sparse_id;

        // Use current generation (increment if needed for reuse)
        uint32_t gen = sparse[sparse_id].generation;

        // Initialize particle data at dense_id
        // ...

        return {sparse_id, gen};
    }

    void Free(ParticleHandle handle) {
        if (!IsValid(handle)) return;

        uint32_t dense_id = sparse[handle.sparse_index].dense_index;
        uint32_t last_dense = --active_count;

        // Swap-and-pop: Copy last particle's data to freed spot
        particle_data[dense_id] = particle_data[last_dense];

        // Update mappings for the moved particle
        uint32_t moved_sparse = dense_to_sparse[last_dense];
        sparse[moved_sparse].dense_index = dense_id;
        dense_to_sparse[dense_id] = moved_sparse;

        // Invalidate: Increment generation and mark as free
        sparse[handle.sparse_index].generation++;
        sparse[handle.sparse_index].dense_index = INVALID;  // e.g., UINT32_MAX
    }

    bool IsValid(ParticleHandle handle) {
        if (handle.sparse_index >= capacity) return false;
        return sparse[handle.sparse_index].generation == handle.generation &&
               sparse[handle.sparse_index].dense_index != INVALID;
    }

    Particle* Get(ParticleHandle handle) {
        if (!IsValid(handle)) return nullptr;
        uint32_t dense_id = sparse[handle.sparse_index].dense_index;
        return &particle_data[dense_id];
    }
};

The Performance Showdown: By the Numbers

Now that we’ve built our arsenal of pool designs, let’s put them to the test. We’ll simulate a particle system under stress, measuring three key phases:

These benchmarks were run with varying total allocations, free rates, and iteration counts to expose each design’s strengths and weaknesses. We’ll walk through each test set, highlighting what the numbers reveal about trade-offs like cache efficiency, allocation speed, and iteration bottlenecks.

Ran on a M4 Pro 12-Core Macbook

Set 1: High Churn (95% Free) - 10,000 Allocs, 9,500 Frees, 1,000 Iterations, 500 Churn/1,000 Frames

This set models a sparse system with lots of frees (e.g., short-lived particles exploding on impact).

DesignAlloc/Free Time (ms)Churn Time (ms)Iteration Time (ms)
Naive Pool12.9935.912.79
Free List Pool0.170.445.48
Packed Array0.270.390.31
Slot Map0.151.190.31

Insights:

Set 2: Low Churn (5% Free) - 10,000 Allocs, 500 Frees, 1,000 Iterations, 500 Churn/1,000 Frames

Here, we have a dense system with few frees (e.g., long-lived entities like NPCs).

DesignAlloc/Free Time (ms)Churn Time (ms)Iteration Time (ms)
Naive Pool12.94613.376.51
Free List Pool0.130.517.63
Packed Array0.150.806.92
Slot Map0.151.697.00

Insights:

Set 3: High Alloc/Free Cost - 100,000 Allocs, 99,500 Frees, 1,000 Iterations, 500 Churn/1,000 Frames

Scaling up to test raw alloc/free scalability (e.g., massive simulations).

DesignAlloc/Free Time (ms)Churn Time (ms)Iteration Time (ms)
Naive Pool1244.4634.8425.13
Free List Pool1.410.4632.69
Packed Array2.730.390.31
Slot Map1.621.190.31

Insights:

Set 4: High Iteration Cost - 10,000 Allocs, 9,500 Frees, 100,000 Iterations, 500 Churn/1,000 Frames

Cranking iterations to extremes (e.g., stress-testing update loops).

DesignAlloc/Free Time (ms)Churn Time (ms)Iteration Time (ms)
Naive Pool12.7835.44273.18
Free List Pool0.150.43542.18
Packed Array0.260.3830.79
Slot Map0.151.1730.62

Insights:

What We’ve Learned from the Numbers

The big lesson? There’s no universal “best”, it hinges on your priorities. High churn? Lean toward Free List for raw efficiency. Iteration-heavy? Packed or Slot Map will transform your frame times. Need safety too? Slot Map reigns supreme as the balanced powerhouse. Always benchmark in your real app; these tests are a starting point, but your data patterns might flip the script!

Conclusion: Mastering Memory

We have journeyed from the raw, untamed power of the arena to the precision of the specialist’s toolkit. In doing so, we have treated memory not as a commodity to be consumed, but as a medium to be sculpted. The free list, the packed array, and the slot map are not merely clever algorithms; they are expressions of a fundamental principle: the structure of your data and the structure of your problem must be in harmony.

This is the core of data-oriented design. It demands that we look past the convenient abstractions of general-purpose allocators and engage directly with the realities of the hardware: with cache lines, pointer indirection, and the relentless cost of a branch misprediction. The resulting systems are not just faster; their performance is more predictable, more scalable, and more robust, because their efficiency is an intrinsic property of their architecture, not an optimization applied as an afterthought. The true magic lies not in any single trick, but in the mastery of these underlying mechanics, allowing us to build solutions that are perfectly tailored to the task at hand.


Suggest an edit
Share this post on:

Next Post
Memory Magic Part 2: The Arena of Power