Deep dive into Go's memory allocator architecture based on TCMalloc, size classes, the tiny allocator, mcache, mcentral, and mheap hierarchies, with strategies for reducing allocation overhead.

Memory Allocator Internals

Go's memory allocator is a marvel of engineering. Understanding how it works reveals why allocations are so fast and how to write allocation-efficient code. The allocator is based on TCMalloc, Google's thread-caching malloc, adapted for Go's runtime.

High-Level Architecture: Three-Level Hierarchy

The allocator uses a three-level hierarchy to minimize lock contention:

┌─────────────────────────────────────────────────────────────┐
│                      Allocation Request                      │
└────────────────┬────────────────────────────────────────────┘
                 │
                 ▼
    ┌───────────────────────────┐
    │ Is it a tiny allocation?  │ (≤16 bytes, no pointers)
    │ (size ≤ 16, no pointers)  │
    └──┬─────────────────────┬──┘
       │ Yes                 │ No
       │                     │
       ▼                     ▼
    ┌─────────────┐    ┌──────────────────┐
    │ Tiny Alloc  │    │ Size class?      │
    │ (packed)    │    │ (67 classes)     │
    └────┬────────┘    └────────┬─────────┘
         │                      │
         ├──────────────────────┤
         │                      │
         ▼                      ▼
    ┌──────────────────────────────────┐
    │ Check P's mcache (lock-free)     │ ← Per-processor
    │ - Fast path, no synchronization  │
    └────┬─────────────────────────────┘
         │ If available
         │
         ▼
    ┌──────────────────────────────────┐
    │ Allocate from mcache's mspan     │ ← Fastest!
    │ Return to caller                 │
    └──────────────────────────────────┘
         │ If mcache empty
         │
         ▼
    ┌──────────────────────────────────┐
    │ Check mcentral (locked)          │ ← Per-size-class
    │ - Requires lock (less frequent)  │
    └────┬─────────────────────────────┘
         │ If available
         │
         ▼
    ┌──────────────────────────────────┐
    │ Allocate from mcentral's mspan   │
    │ Refill mcache, return to caller  │
    └──────────────────────────────────┘
         │ If mcentral empty
         │
         ▼
    ┌──────────────────────────────────┐
    │ Check mheap (locked)             │ ← Global
    │ - Requires lock (rare)           │
    └────┬─────────────────────────────┘
         │
         ▼
    ┌──────────────────────────────────┐
    │ Allocate pages from heap         │
    │ Create new mspan                 │
    │ Return to caller                 │
    └──────────────────────────────────┘
         │ If heap full
         │
         ▼
    ┌──────────────────────────────────┐
    │ Request memory from OS (mmap)    │
    │ Add to mheap                     │
    └──────────────────────────────────┘

The key insight: Most allocations never hit a lock because they're satisfied from the per-P cache (mcache).

Size Classes: The 67 Categories

Go doesn't allocate exact sizes. Instead, each allocation is rounded up to the nearest size class. There are 67 size classes:

Class  Size     Class  Size      Class  Size      Class  Size
────────────    ────────────    ────────────    ──────────────
  0    8B        20    288B       40    1280B      60   9728B
  1    16B       21    320B       41    1408B      61  10880B
  2    32B       22    352B       42    1536B      62  12160B
  3    48B       23    384B       43    1792B      63  13568B
  4    64B       24    416B       44    1920B      64  15104B
  5    80B       25    448B       45    2048B      65  16640B
  6    96B       26    480B       46    2304B      66  32768B
  7    112B      27    512B       47    2688B
  8    128B      28    576B       48    3072B
  9    144B      29    640B       49    3200B
 10    160B      30    704B       50    3456B
 11    176B      31    768B       51    4096B
 12    192B      32    896B       52    4864B
 13    208B      33   1024B       53    5376B
 14    224B      34   1152B       54    6144B
 15    240B      35   1280B       55    6784B
 16    256B      36   1408B       56    7936B
 17    272B      37   1536B       57    8448B
 18    288B      38   1792B       58    9216B
 19    320B      39   1920B       59   10240B

The ratios increase gradually to balance overhead and waste. Small objects use smaller increments (8B per class), larger objects use bigger jumps (maintaining ~12% waste ratio).

The Tiny Allocator: Packing Small Objects

Objects ≤16 bytes without pointers are allocated specially. Multiple tiny allocations share a single 16-byte block:

Allocation 1: 4 bytes (int32)
Allocation 2: 6 bytes (string header without pointer)
Allocation 3: 2 bytes (uint16)

Packed into single 16-byte block:
┌─────────┬──────────┬─────────────────────┬──────────┐
│ Alloc 1 │ Padding  │ Alloc 2              │ Alloc 3  │
│ 4 bytes │ unused   │ 6 bytes              │ 2 bytes  │
│ [  int32][........][     string part    ][  uint16 ]│
└─────────┴──────────┴─────────────────────┴──────────┘
  0       4        8                    14 15        16

This is optimized in the allocator:

// Tiny allocator logic
const (
    maxTinySize = 16
    tinySizeClass = 2
    tinyNumFree = 32  // tiny blocks per mspan
)

// If tiny allocation
if size <= maxTinySize && (size == 0 || !hasPointers(typ)) {
    // Try to pack into existing tiny block
    off := round(p.tinyoff, align)
    if off+size <= 16 {
        // Fits in current tiny block!
        x = unsafe.Pointer(p.tinyBase + off)
        p.tinyoff = off + size
        p.tinyAllocs++
        return
    }
    // Current block full, get next tiny block
}

Example: Multiple Tiny Allocations

// Without pointers — packed!
a := new(bool)      // 1 byte
b := new(int8)      // 1 byte
c := new(uint16)    // 2 bytes
// All three might share the same 16-byte block

// With pointers — separate allocations
type MyString struct {
    ptr string
}
d := new(MyString)  // Goes to normal size class (even though ≤16 bytes)

This tiny allocator typically saves 30-50% memory for allocation-heavy workloads with many small objects.

Size Classes in Practice: Understanding Waste

When you allocate, the runtime rounds your request up:

// Requested vs. Actual Size
requested := 33      // Your allocation
sizeClass := mallocgc_roundUpSize(33)  // Returns 48
wasted := 48 - 33     // 15 bytes wasted!
percentWaste := float64(15) / 48 * 100  // 31% waste

// Real numbers:
// 17 bytes → 32 bytes (47% waste)
// 33 bytes → 48 bytes (31% waste)
// 65 bytes → 80 bytes (18% waste)
// 100 bytes → 112 bytes (11% waste)
// 1000 bytes → 1024 bytes (2% waste)

For many small allocations, this adds up:

// Benchmark: allocation waste
func BenchmarkAllocationWaste(b *testing.B) {
    for i := 0; i < b.N; i++ {
        // Each allocation wastes some bytes
        s := make([]int32, 7)  // 28 bytes → size class 32 (12% waste)
        _ = s
    }
}

// Over 1 million such allocations:
// 28MB requested * 1.12 = 31.4MB allocated
// 3.4MB wasted from rounding!

Strategy: Combine small allocations into larger structs when possible:

// Bad: Three separate allocations
a := new(int32)   // 4 bytes → 8 bytes
b := new(int32)   // 4 bytes → 8 bytes
c := new(int32)   // 4 bytes → 8 bytes
// Total: 12 bytes requested, 24 bytes allocated (100% waste)

// Good: One allocation
type ABC struct {
    a, b, c int32
}
x := new(ABC)  // 12 bytes → 16 bytes (33% waste)
// Total: 12 bytes requested, 16 bytes allocated

The mspan: Memory Organization

All memory in Go's heap is divided into units called mspan. Each mspan is a contiguous run of 8KB pages (typically 1-255 pages), with all objects in an mspan being the same size class:

mspan for 32-byte objects:
┌──────────────────────────────────────┐
│         One or more 8KB pages         │
├──────────────────────────────────────┤
│ 32B  │ 32B  │ 32B  │ 32B  │ ... │ 32B│
│ slot  │ slot │ slot │ slot │     │slot│
├──────┼──────┼──────┼──────┼─────┼────┤
│ Obj1 │ Obj2 │ Obj3 │ Obj4 │ ... │ObjN│ ← Same size
│(used)│(used)│(free)│(used)│     │free│
└──────┴──────┴──────┴──────┴─────┴────┘

Each mspan tracks:

Base address
Number of pages
Size class (which determines object size)
Allocation bitmap (which slots are used)
Free list of available slots

type mspan struct {
    next, prev *mspan        // doubly-linked list
    list *mspanList         // which list is it in?
    startAddr uintptr       // address of first page
    npages    uintptr       // number of pages
    freeindex uint16        // index of first free slot
    nelems    uintptr       // number of object slots
    allocBits *gcBits       // bitmap: 1=allocated, 0=free
    gcmarkBits *gcBits      // GC mark bitmap
    sizeclass uint8         // size class
    elemsize  uintptr       // bytes per object
    allocCount uint16       // # of allocated objects
}

Allocation Path in Detail

Here's what happens when you call make([]byte, 100) or new(MyStruct):

func mallocgc(size uintptr, typ *_type, needzero bool) unsafe.Pointer {
    // Step 1: Get current G and M
    mp := acquirem()
    pp := mp.p.ptr()

    // Step 2: Determine if tiny allocation
    if size <= maxTinySize && (typ == nil || !hasPointers(typ)) {
        return allocTiny(pp, size)  // Goes here for small types
    }

    // Step 3: Determine size class
    var sizeclass uint8
    if size <= smallMaxSize {
        sizeclass = size2class[divRoundUp(size, smallSizeDiv)]
        size = class2size[sizeclass]
    } else {
        // Large allocation, goes straight to mheap
        return allocLarge(pp, size, needzero)
    }

    // Step 4: Try to allocate from mcache (lock-free!)
    c := pp.mcache
    x := c.alloc[sizeclass]  // mspan for this size class

    v := x.allocate()  // Try to get from this mspan
    if v != nil {
        return v  // SUCCESS! No locks needed
    }

    // Step 5: mcache empty, refill from mcentral (locked)
    x = mcentral[sizeclass].cacheSpan()
    if x == nil {
        // Step 6: mcentral empty, allocate from mheap (locked)
        x = mheap_.alloc(npages, sizeclass)
    }

    // Step 7: Update mcache and return
    c.alloc[sizeclass] = x
    v := x.allocate()
    return v
}

mcache: Per-Processor Cache

Each P has its own mcache with one mspan for each size class. This cache is lock-free because only one M+P pair ever accesses it:

mcache for P:
┌─────────────────────────────────────────────────┐
│ Per-P Memory Cache (mcache)                     │
├─────────────────────────────────────────────────┤
│ [size 8]   ─→ mspan (8-byte objects)           │
│ [size 16]  ─→ mspan (16-byte objects)          │
│ [size 32]  ─→ mspan (32-byte objects)          │
│ [size 48]  ─→ mspan (48-byte objects)          │
│   ...                                           │
│ [size 32K] ─→ mspan (32KB objects)             │
│ [tiny]     ─→ mspan (packed tiny blocks)       │
├─────────────────────────────────────────────────┤
│ No locks needed! This P owns this cache.        │
└─────────────────────────────────────────────────┘

The lock-free nature makes allocations from mcache extremely fast — typically 10-50 nanoseconds.

mcentral: Per-Size-Class Store

When an mcache is empty, it's refilled from the mcentral, which is protected by a lock:

mcentral[sizeclass 32]:
┌──────────────────────────────────┐
│ Lock: &mcentral[32].lock         │
├──────────────────────────────────┤
│ nonempty (mspan list):           │
│  └─ [mspan] ─ [mspan] ─ [mspan]  │
│     (each has free slots)         │
├──────────────────────────────────┤
│ empty (mspan list):              │
│  └─ [mspan] ─ [mspan] ─ [mspan]  │
│     (all slots allocated)         │
└──────────────────────────────────┘

When P's mcache[32] is empty:
1. Lock mcentral[32].lock
2. Take an mspan from nonempty list
3. Move mspan to P's mcache[32]
4. Unlock

Because refills happen in batches (one mspan might have thousands of free slots), the lock is held briefly and infrequently.

mheap: Global Memory Manager

The mheap is the global memory manager. It tracks all pages in the heap and allocates to mcentral when needed:

mheap:
┌─────────────────────────────────────────┐
│ Lock: &mheap.lock (rarely taken)        │
├─────────────────────────────────────────┤
│ pages[addr] = mspan mapping              │
│ (which mspan does each page belong to?)  │
├─────────────────────────────────────────┤
│ free (set of free mspans)                │
│ free[size 1] → [mspan] ─ [mspan] ─ ... │
│ free[size 2] → [mspan] ─ [mspan] ─ ... │
│ free[size 4] → [mspan] ─ [mspan] ─ ... │
│   ...                                   │
├─────────────────────────────────────────┤
│ mspans (all mspans in heap)              │
├─────────────────────────────────────────┤
│ arenaHints (where to ask OS for memory) │
└─────────────────────────────────────────┘

When mcentral needs pages:

Lock mheap
Find mspan with N free pages (or allocate new from OS)
Unlock
Return to mcentral

For large allocations (>32KB), allocation goes directly to mheap, bypassing mcentral.

Allocation Request Path: Complete Example

func example() {
    // Example: allocate []int32 with 7 elements
    s := make([]int32, 7)
}

// Runtime allocation sequence:
// 1. size = 7 * 4 = 28 bytes
// 2. Round up to size class: 32 bytes
// 3. Check P's mcache[sizeclass for 32B]
// 4. If mspan has free slots:
//    - Mark slot as used
//    - Return pointer
//    - DONE (no locks, ~10-50ns)
//
// 5. If mcache mspan is empty:
//    - Lock mcentral[sizeclass 32]
//    - Pop mspan from nonempty list
//    - Move to mcache
//    - Unlock
//    - Allocate from new mspan
//    - Return pointer
//    - (~100-500ns)
//
// 6. If mcentral has no mspans:
//    - Lock mheap
//    - Allocate pages from mheap
//    - Create new mspan
//    - Unlock
//    - Give to mcentral
//    - Try again from step 5
//    - (~1-10μs)
//
// 7. If mheap has no free pages:
//    - Call OS (mmap on Linux)
//    - Add pages to mheap
//    - Try again from step 6
//    - (~10-100μs)

Memory Returned to OS: Scavenging

Go doesn't hold onto freed memory forever. The background scavenger runs periodically and returns unused memory to the OS using MADV_DONTNEED:

// Forcing memory return to OS
var m runtime.MemStats
runtime.ReadMemStats(&m)
fmt.Printf("HeapSys: %v bytes (allocated from OS)\n", m.HeapSys)

// Force garbage collection
runtime.GC()

// Force scavenging
debug.FreeOSMemory()

runtime.ReadMemStats(&m)
fmt.Printf("HeapSys: %v bytes (reduced after scavenge)\n", m.HeapSys)

The scavenger is automatic (background) but can be influenced by GOGC settings.

Arena Hints: Contiguous Heap

Go tries to request memory in large contiguous blocks from the OS (64MB on 64-bit machines) called arena hints. This provides better spatial locality:

// Linux: mmap called with MAP_FIXED_NOREPLACE
// Requesting contiguous regions like:
// [0x00: 64MB) ─ [64MB: 128MB) ─ [128MB: 192MB) ─ ...
//
// Benefits:
// 1. Better TLB efficiency (fewer page table entries)
// 2. Pointer arithmetic is faster with known layout
// 3. Virtual address space is more efficient

Size-Class Aware Optimization: Practical Code

Understanding size classes allows you to optimize allocations:

// Example 1: Combining allocations
// Bad:
type Point struct{}
a := new(int32)       // 4 → 8 bytes
b := new(int32)       // 4 → 8 bytes
c := new(int32)       // 4 → 8 bytes
// Total allocated: 24 bytes

// Good:
type Point struct {
    x, y, z int32
}
p := new(Point)       // 12 → 16 bytes
// Total allocated: 16 bytes

// Example 2: Avoiding fragmentation
// Bad: Variable-sized allocations cause fragmentation
sizes := []int{7, 15, 33, 65, 129}  // Fits in classes: 8, 16, 48, 80, 144
// Lots of wasted space across classes

// Good: Use fixed-size chunks
const chunkSize = 32
chunks := make([]byte, 32)  // Efficient, predictable

// Example 3: Pointer vs non-pointer impact
// GC must scan any type containing pointers
// So minimize them in hot data structures

type BadLayout struct {
    data *[]byte  // Requires scanning
    metadata int64
}

type GoodLayout struct {
    data []byte   // No pointer to scan (slice header contains ptr)
    metadata int64
}

Reading MemStats

The runtime.MemStats struct reveals allocator behavior:

func analyzeMemory() {
    var m runtime.MemStats
    runtime.ReadMemStats(&m)

    fmt.Printf("HeapAlloc: %v B\n", m.HeapAlloc)      // Current allocs
    fmt.Printf("HeapSys: %v B\n", m.HeapSys)          // Total from OS
    fmt.Printf("HeapIdle: %v B\n", m.HeapIdle)        // Free in heap
    fmt.Printf("HeapInuse: %v B\n", m.HeapInuse)      // In use
    fmt.Printf("HeapReleased: %v B\n", m.HeapReleased)// Returned to OS
    fmt.Printf("StackInuse: %v B\n", m.StackInuse)    // Goroutine stacks
    fmt.Printf("MSpanInuse: %v B\n", m.MSpanInuse)    // mspan metadata
    fmt.Printf("MCacheInuse: %v B\n", m.MCacheInuse)  // mcache metadata

    fmt.Printf("\nAllocation stats:\n")
    fmt.Printf("Mallocs: %v\n", m.Mallocs)            // Total mallocs
    fmt.Printf("Frees: %v\n", m.Frees)                // Total frees
    fmt.Printf("LiveObjects: %v\n", m.Mallocs - m.Frees)
}

Benchmarking Allocation Performance

func BenchmarkAllocationSizes(b *testing.B) {
    tests := []struct {
        name string
        size int
    }{
        {"tiny_8", 8},
        {"small_32", 32},
        {"small_256", 256},
        {"large_1K", 1024},
        {"large_10K", 10 * 1024},
        {"large_1M", 1024 * 1024},
    }

    for _, tt := range tests {
        b.Run(tt.name, func(b *testing.B) {
            b.ReportAllocs()
            for i := 0; i < b.N; i++ {
                s := make([]byte, tt.size)
                _ = s[0]  // Prevent inlining
            }
        })
    }
}

// Sample output (on modern hardware):
// BenchmarkAllocationSizes/tiny_8-8          30000000    35.2 ns/op    8 B/op    1 allocs/op
// BenchmarkAllocationSizes/small_32-8        20000000    52.1 ns/op   32 B/op    1 allocs/op
// BenchmarkAllocationSizes/small_256-8       10000000   103.4 ns/op  256 B/op    1 allocs/op
// BenchmarkAllocationSizes/large_1K-8         5000000   241.7 ns/op 1024 B/op    1 allocs/op
// BenchmarkAllocationSizes/large_10K-8        1000000  1145.3 ns/op 10240 B/op   1 allocs/op
// BenchmarkAllocationSizes/large_1M-8           10000 105843.0 ns/op 1048576 B/op 1 allocs/op

Notice:

Tiny allocations are fastest (~30ns)
Small allocations scale linearly until ~10KB
Large allocations require OS interaction (~100+μs)

Allocation Patterns That Scale Well

// Pattern 1: Object pool (reuse allocations)
type objPool struct {
    available []*largeObject
    mu sync.Mutex
}

func (p *objPool) Get() *largeObject {
    p.mu.Lock()
    if len(p.available) > 0 {
        obj := p.available[len(p.available)-1]
        p.available = p.available[:len(p.available)-1]
        p.mu.Unlock()
        return obj
    }
    p.mu.Unlock()
    return &largeObject{}  // Allocate new
}

func (p *objPool) Put(obj *largeObject) {
    p.mu.Lock()
    p.available = append(p.available, obj)
    p.mu.Unlock()
}

// Pattern 2: Preallocate slices
data := make([]Item, 0, expectedCapacity)
for _, item := range input {
    data = append(data, item)  // No allocation if cap sufficient
}

// Pattern 3: Avoid pointer fields in high-frequency types
type Event struct {
    // GOOD: Value field
    data [256]byte

    // BAD: Pointer field (requires GC scanning)
    // ptr *[256]byte
}

Summary

Go's memory allocator is a marvel of engineering:

Lock-free fast path: Most allocations hit the mcache (10-50ns)
Size classes: Reduces fragmentation while keeping waste under 12%
Hierarchical architecture: Only takes locks when necessary
Tiny allocator: Packs small objects, saving memory
Per-P caching: No contention between goroutines on different Ps

Understanding these internals lets you:

Reduce allocation overhead in hot paths
Minimize memory waste from rounding
Design data structures that play well with size classes
Know when to use object pools vs. allocation
Interpret MemStats and identify allocation bottlenecks

The allocator's design ensures that in well-written Go code, allocation is never the bottleneck — but in poorly-written code, it can be. Profile your code and allocate wisely.

Memory Allocator Internals - TCMalloc and Size Classes