Memory Allocator Internals - TCMalloc and Size Classes
Deep dive into Go's memory allocator architecture based on TCMalloc, size classes, the tiny allocator, mcache, mcentral, and mheap hierarchies, with strategies for reducing allocation overhead.
Memory Allocator Internals
Go's memory allocator is a marvel of engineering. Understanding how it works reveals why allocations are so fast and how to write allocation-efficient code. The allocator is based on TCMalloc, Google's thread-caching malloc, adapted for Go's runtime.
High-Level Architecture: Three-Level Hierarchy
The allocator uses a three-level hierarchy to minimize lock contention:
┌─────────────────────────────────────────────────────────────┐
│ Allocation Request │
└────────────────┬────────────────────────────────────────────┘
│
▼
┌───────────────────────────┐
│ Is it a tiny allocation? │ (≤16 bytes, no pointers)
│ (size ≤ 16, no pointers) │
└──┬─────────────────────┬──┘
│ Yes │ No
│ │
▼ ▼
┌─────────────┐ ┌──────────────────┐
│ Tiny Alloc │ │ Size class? │
│ (packed) │ │ (67 classes) │
└────┬────────┘ └────────┬─────────┘
│ │
├──────────────────────┤
│ │
▼ ▼
┌──────────────────────────────────┐
│ Check P's mcache (lock-free) │ ← Per-processor
│ - Fast path, no synchronization │
└────┬─────────────────────────────┘
│ If available
│
▼
┌──────────────────────────────────┐
│ Allocate from mcache's mspan │ ← Fastest!
│ Return to caller │
└──────────────────────────────────┘
│ If mcache empty
│
▼
┌──────────────────────────────────┐
│ Check mcentral (locked) │ ← Per-size-class
│ - Requires lock (less frequent) │
└────┬─────────────────────────────┘
│ If available
│
▼
┌──────────────────────────────────┐
│ Allocate from mcentral's mspan │
│ Refill mcache, return to caller │
└──────────────────────────────────┘
│ If mcentral empty
│
▼
┌──────────────────────────────────┐
│ Check mheap (locked) │ ← Global
│ - Requires lock (rare) │
└────┬─────────────────────────────┘
│
▼
┌──────────────────────────────────┐
│ Allocate pages from heap │
│ Create new mspan │
│ Return to caller │
└──────────────────────────────────┘
│ If heap full
│
▼
┌──────────────────────────────────┐
│ Request memory from OS (mmap) │
│ Add to mheap │
└──────────────────────────────────┘The key insight: Most allocations never hit a lock because they're satisfied from the per-P cache (mcache).
Size Classes: The 67 Categories
Go doesn't allocate exact sizes. Instead, each allocation is rounded up to the nearest size class. There are 67 size classes:
Class Size Class Size Class Size Class Size
──────────── ──────────── ──────────── ──────────────
0 8B 20 288B 40 1280B 60 9728B
1 16B 21 320B 41 1408B 61 10880B
2 32B 22 352B 42 1536B 62 12160B
3 48B 23 384B 43 1792B 63 13568B
4 64B 24 416B 44 1920B 64 15104B
5 80B 25 448B 45 2048B 65 16640B
6 96B 26 480B 46 2304B 66 32768B
7 112B 27 512B 47 2688B
8 128B 28 576B 48 3072B
9 144B 29 640B 49 3200B
10 160B 30 704B 50 3456B
11 176B 31 768B 51 4096B
12 192B 32 896B 52 4864B
13 208B 33 1024B 53 5376B
14 224B 34 1152B 54 6144B
15 240B 35 1280B 55 6784B
16 256B 36 1408B 56 7936B
17 272B 37 1536B 57 8448B
18 288B 38 1792B 58 9216B
19 320B 39 1920B 59 10240BThe ratios increase gradually to balance overhead and waste. Small objects use smaller increments (8B per class), larger objects use bigger jumps (maintaining ~12% waste ratio).
The Tiny Allocator: Packing Small Objects
Objects ≤16 bytes without pointers are allocated specially. Multiple tiny allocations share a single 16-byte block:
Allocation 1: 4 bytes (int32)
Allocation 2: 6 bytes (string header without pointer)
Allocation 3: 2 bytes (uint16)
Packed into single 16-byte block:
┌─────────┬──────────┬─────────────────────┬──────────┐
│ Alloc 1 │ Padding │ Alloc 2 │ Alloc 3 │
│ 4 bytes │ unused │ 6 bytes │ 2 bytes │
│ [ int32][........][ string part ][ uint16 ]│
└─────────┴──────────┴─────────────────────┴──────────┘
0 4 8 14 15 16This is optimized in the allocator:
// Tiny allocator logic
const (
maxTinySize = 16
tinySizeClass = 2
tinyNumFree = 32 // tiny blocks per mspan
)
// If tiny allocation
if size <= maxTinySize && (size == 0 || !hasPointers(typ)) {
// Try to pack into existing tiny block
off := round(p.tinyoff, align)
if off+size <= 16 {
// Fits in current tiny block!
x = unsafe.Pointer(p.tinyBase + off)
p.tinyoff = off + size
p.tinyAllocs++
return
}
// Current block full, get next tiny block
}Example: Multiple Tiny Allocations
// Without pointers — packed!
a := new(bool) // 1 byte
b := new(int8) // 1 byte
c := new(uint16) // 2 bytes
// All three might share the same 16-byte block
// With pointers — separate allocations
type MyString struct {
ptr string
}
d := new(MyString) // Goes to normal size class (even though ≤16 bytes)This tiny allocator typically saves 30-50% memory for allocation-heavy workloads with many small objects.
Size Classes in Practice: Understanding Waste
When you allocate, the runtime rounds your request up:
// Requested vs. Actual Size
requested := 33 // Your allocation
sizeClass := mallocgc_roundUpSize(33) // Returns 48
wasted := 48 - 33 // 15 bytes wasted!
percentWaste := float64(15) / 48 * 100 // 31% waste
// Real numbers:
// 17 bytes → 32 bytes (47% waste)
// 33 bytes → 48 bytes (31% waste)
// 65 bytes → 80 bytes (18% waste)
// 100 bytes → 112 bytes (11% waste)
// 1000 bytes → 1024 bytes (2% waste)For many small allocations, this adds up:
// Benchmark: allocation waste
func BenchmarkAllocationWaste(b *testing.B) {
for i := 0; i < b.N; i++ {
// Each allocation wastes some bytes
s := make([]int32, 7) // 28 bytes → size class 32 (12% waste)
_ = s
}
}
// Over 1 million such allocations:
// 28MB requested * 1.12 = 31.4MB allocated
// 3.4MB wasted from rounding!Strategy: Combine small allocations into larger structs when possible:
// Bad: Three separate allocations
a := new(int32) // 4 bytes → 8 bytes
b := new(int32) // 4 bytes → 8 bytes
c := new(int32) // 4 bytes → 8 bytes
// Total: 12 bytes requested, 24 bytes allocated (100% waste)
// Good: One allocation
type ABC struct {
a, b, c int32
}
x := new(ABC) // 12 bytes → 16 bytes (33% waste)
// Total: 12 bytes requested, 16 bytes allocatedThe mspan: Memory Organization
All memory in Go's heap is divided into units called mspan. Each mspan is a contiguous run of 8KB pages (typically 1-255 pages), with all objects in an mspan being the same size class:
mspan for 32-byte objects:
┌──────────────────────────────────────┐
│ One or more 8KB pages │
├──────────────────────────────────────┤
│ 32B │ 32B │ 32B │ 32B │ ... │ 32B│
│ slot │ slot │ slot │ slot │ │slot│
├──────┼──────┼──────┼──────┼─────┼────┤
│ Obj1 │ Obj2 │ Obj3 │ Obj4 │ ... │ObjN│ ← Same size
│(used)│(used)│(free)│(used)│ │free│
└──────┴──────┴──────┴──────┴─────┴────┘Each mspan tracks:
- Base address
- Number of pages
- Size class (which determines object size)
- Allocation bitmap (which slots are used)
- Free list of available slots
type mspan struct {
next, prev *mspan // doubly-linked list
list *mspanList // which list is it in?
startAddr uintptr // address of first page
npages uintptr // number of pages
freeindex uint16 // index of first free slot
nelems uintptr // number of object slots
allocBits *gcBits // bitmap: 1=allocated, 0=free
gcmarkBits *gcBits // GC mark bitmap
sizeclass uint8 // size class
elemsize uintptr // bytes per object
allocCount uint16 // # of allocated objects
}Allocation Path in Detail
Here's what happens when you call make([]byte, 100) or new(MyStruct):
func mallocgc(size uintptr, typ *_type, needzero bool) unsafe.Pointer {
// Step 1: Get current G and M
mp := acquirem()
pp := mp.p.ptr()
// Step 2: Determine if tiny allocation
if size <= maxTinySize && (typ == nil || !hasPointers(typ)) {
return allocTiny(pp, size) // Goes here for small types
}
// Step 3: Determine size class
var sizeclass uint8
if size <= smallMaxSize {
sizeclass = size2class[divRoundUp(size, smallSizeDiv)]
size = class2size[sizeclass]
} else {
// Large allocation, goes straight to mheap
return allocLarge(pp, size, needzero)
}
// Step 4: Try to allocate from mcache (lock-free!)
c := pp.mcache
x := c.alloc[sizeclass] // mspan for this size class
v := x.allocate() // Try to get from this mspan
if v != nil {
return v // SUCCESS! No locks needed
}
// Step 5: mcache empty, refill from mcentral (locked)
x = mcentral[sizeclass].cacheSpan()
if x == nil {
// Step 6: mcentral empty, allocate from mheap (locked)
x = mheap_.alloc(npages, sizeclass)
}
// Step 7: Update mcache and return
c.alloc[sizeclass] = x
v := x.allocate()
return v
}mcache: Per-Processor Cache
Each P has its own mcache with one mspan for each size class. This cache is lock-free because only one M+P pair ever accesses it:
mcache for P:
┌─────────────────────────────────────────────────┐
│ Per-P Memory Cache (mcache) │
├─────────────────────────────────────────────────┤
│ [size 8] ─→ mspan (8-byte objects) │
│ [size 16] ─→ mspan (16-byte objects) │
│ [size 32] ─→ mspan (32-byte objects) │
│ [size 48] ─→ mspan (48-byte objects) │
│ ... │
│ [size 32K] ─→ mspan (32KB objects) │
│ [tiny] ─→ mspan (packed tiny blocks) │
├─────────────────────────────────────────────────┤
│ No locks needed! This P owns this cache. │
└─────────────────────────────────────────────────┘The lock-free nature makes allocations from mcache extremely fast — typically 10-50 nanoseconds.
mcentral: Per-Size-Class Store
When an mcache is empty, it's refilled from the mcentral, which is protected by a lock:
mcentral[sizeclass 32]:
┌──────────────────────────────────┐
│ Lock: &mcentral[32].lock │
├──────────────────────────────────┤
│ nonempty (mspan list): │
│ └─ [mspan] ─ [mspan] ─ [mspan] │
│ (each has free slots) │
├──────────────────────────────────┤
│ empty (mspan list): │
│ └─ [mspan] ─ [mspan] ─ [mspan] │
│ (all slots allocated) │
└──────────────────────────────────┘
When P's mcache[32] is empty:
1. Lock mcentral[32].lock
2. Take an mspan from nonempty list
3. Move mspan to P's mcache[32]
4. UnlockBecause refills happen in batches (one mspan might have thousands of free slots), the lock is held briefly and infrequently.
mheap: Global Memory Manager
The mheap is the global memory manager. It tracks all pages in the heap and allocates to mcentral when needed:
mheap:
┌─────────────────────────────────────────┐
│ Lock: &mheap.lock (rarely taken) │
├─────────────────────────────────────────┤
│ pages[addr] = mspan mapping │
│ (which mspan does each page belong to?) │
├─────────────────────────────────────────┤
│ free (set of free mspans) │
│ free[size 1] → [mspan] ─ [mspan] ─ ... │
│ free[size 2] → [mspan] ─ [mspan] ─ ... │
│ free[size 4] → [mspan] ─ [mspan] ─ ... │
│ ... │
├─────────────────────────────────────────┤
│ mspans (all mspans in heap) │
├─────────────────────────────────────────┤
│ arenaHints (where to ask OS for memory) │
└─────────────────────────────────────────┘When mcentral needs pages:
- Lock mheap
- Find mspan with N free pages (or allocate new from OS)
- Unlock
- Return to mcentral
For large allocations (>32KB), allocation goes directly to mheap, bypassing mcentral.
Allocation Request Path: Complete Example
func example() {
// Example: allocate []int32 with 7 elements
s := make([]int32, 7)
}
// Runtime allocation sequence:
// 1. size = 7 * 4 = 28 bytes
// 2. Round up to size class: 32 bytes
// 3. Check P's mcache[sizeclass for 32B]
// 4. If mspan has free slots:
// - Mark slot as used
// - Return pointer
// - DONE (no locks, ~10-50ns)
//
// 5. If mcache mspan is empty:
// - Lock mcentral[sizeclass 32]
// - Pop mspan from nonempty list
// - Move to mcache
// - Unlock
// - Allocate from new mspan
// - Return pointer
// - (~100-500ns)
//
// 6. If mcentral has no mspans:
// - Lock mheap
// - Allocate pages from mheap
// - Create new mspan
// - Unlock
// - Give to mcentral
// - Try again from step 5
// - (~1-10μs)
//
// 7. If mheap has no free pages:
// - Call OS (mmap on Linux)
// - Add pages to mheap
// - Try again from step 6
// - (~10-100μs)Memory Returned to OS: Scavenging
Go doesn't hold onto freed memory forever. The background scavenger runs periodically and returns unused memory to the OS using MADV_DONTNEED:
// Forcing memory return to OS
var m runtime.MemStats
runtime.ReadMemStats(&m)
fmt.Printf("HeapSys: %v bytes (allocated from OS)\n", m.HeapSys)
// Force garbage collection
runtime.GC()
// Force scavenging
debug.FreeOSMemory()
runtime.ReadMemStats(&m)
fmt.Printf("HeapSys: %v bytes (reduced after scavenge)\n", m.HeapSys)The scavenger is automatic (background) but can be influenced by GOGC settings.
Arena Hints: Contiguous Heap
Go tries to request memory in large contiguous blocks from the OS (64MB on 64-bit machines) called arena hints. This provides better spatial locality:
// Linux: mmap called with MAP_FIXED_NOREPLACE
// Requesting contiguous regions like:
// [0x00: 64MB) ─ [64MB: 128MB) ─ [128MB: 192MB) ─ ...
//
// Benefits:
// 1. Better TLB efficiency (fewer page table entries)
// 2. Pointer arithmetic is faster with known layout
// 3. Virtual address space is more efficientSize-Class Aware Optimization: Practical Code
Understanding size classes allows you to optimize allocations:
// Example 1: Combining allocations
// Bad:
type Point struct{}
a := new(int32) // 4 → 8 bytes
b := new(int32) // 4 → 8 bytes
c := new(int32) // 4 → 8 bytes
// Total allocated: 24 bytes
// Good:
type Point struct {
x, y, z int32
}
p := new(Point) // 12 → 16 bytes
// Total allocated: 16 bytes
// Example 2: Avoiding fragmentation
// Bad: Variable-sized allocations cause fragmentation
sizes := []int{7, 15, 33, 65, 129} // Fits in classes: 8, 16, 48, 80, 144
// Lots of wasted space across classes
// Good: Use fixed-size chunks
const chunkSize = 32
chunks := make([]byte, 32) // Efficient, predictable
// Example 3: Pointer vs non-pointer impact
// GC must scan any type containing pointers
// So minimize them in hot data structures
type BadLayout struct {
data *[]byte // Requires scanning
metadata int64
}
type GoodLayout struct {
data []byte // No pointer to scan (slice header contains ptr)
metadata int64
}Reading MemStats
The runtime.MemStats struct reveals allocator behavior:
func analyzeMemory() {
var m runtime.MemStats
runtime.ReadMemStats(&m)
fmt.Printf("HeapAlloc: %v B\n", m.HeapAlloc) // Current allocs
fmt.Printf("HeapSys: %v B\n", m.HeapSys) // Total from OS
fmt.Printf("HeapIdle: %v B\n", m.HeapIdle) // Free in heap
fmt.Printf("HeapInuse: %v B\n", m.HeapInuse) // In use
fmt.Printf("HeapReleased: %v B\n", m.HeapReleased)// Returned to OS
fmt.Printf("StackInuse: %v B\n", m.StackInuse) // Goroutine stacks
fmt.Printf("MSpanInuse: %v B\n", m.MSpanInuse) // mspan metadata
fmt.Printf("MCacheInuse: %v B\n", m.MCacheInuse) // mcache metadata
fmt.Printf("\nAllocation stats:\n")
fmt.Printf("Mallocs: %v\n", m.Mallocs) // Total mallocs
fmt.Printf("Frees: %v\n", m.Frees) // Total frees
fmt.Printf("LiveObjects: %v\n", m.Mallocs - m.Frees)
}Benchmarking Allocation Performance
func BenchmarkAllocationSizes(b *testing.B) {
tests := []struct {
name string
size int
}{
{"tiny_8", 8},
{"small_32", 32},
{"small_256", 256},
{"large_1K", 1024},
{"large_10K", 10 * 1024},
{"large_1M", 1024 * 1024},
}
for _, tt := range tests {
b.Run(tt.name, func(b *testing.B) {
b.ReportAllocs()
for i := 0; i < b.N; i++ {
s := make([]byte, tt.size)
_ = s[0] // Prevent inlining
}
})
}
}
// Sample output (on modern hardware):
// BenchmarkAllocationSizes/tiny_8-8 30000000 35.2 ns/op 8 B/op 1 allocs/op
// BenchmarkAllocationSizes/small_32-8 20000000 52.1 ns/op 32 B/op 1 allocs/op
// BenchmarkAllocationSizes/small_256-8 10000000 103.4 ns/op 256 B/op 1 allocs/op
// BenchmarkAllocationSizes/large_1K-8 5000000 241.7 ns/op 1024 B/op 1 allocs/op
// BenchmarkAllocationSizes/large_10K-8 1000000 1145.3 ns/op 10240 B/op 1 allocs/op
// BenchmarkAllocationSizes/large_1M-8 10000 105843.0 ns/op 1048576 B/op 1 allocs/opNotice:
- Tiny allocations are fastest (~30ns)
- Small allocations scale linearly until ~10KB
- Large allocations require OS interaction (~100+μs)
Allocation Patterns That Scale Well
// Pattern 1: Object pool (reuse allocations)
type objPool struct {
available []*largeObject
mu sync.Mutex
}
func (p *objPool) Get() *largeObject {
p.mu.Lock()
if len(p.available) > 0 {
obj := p.available[len(p.available)-1]
p.available = p.available[:len(p.available)-1]
p.mu.Unlock()
return obj
}
p.mu.Unlock()
return &largeObject{} // Allocate new
}
func (p *objPool) Put(obj *largeObject) {
p.mu.Lock()
p.available = append(p.available, obj)
p.mu.Unlock()
}
// Pattern 2: Preallocate slices
data := make([]Item, 0, expectedCapacity)
for _, item := range input {
data = append(data, item) // No allocation if cap sufficient
}
// Pattern 3: Avoid pointer fields in high-frequency types
type Event struct {
// GOOD: Value field
data [256]byte
// BAD: Pointer field (requires GC scanning)
// ptr *[256]byte
}Summary
Go's memory allocator is a marvel of engineering:
- Lock-free fast path: Most allocations hit the mcache (10-50ns)
- Size classes: Reduces fragmentation while keeping waste under 12%
- Hierarchical architecture: Only takes locks when necessary
- Tiny allocator: Packs small objects, saving memory
- Per-P caching: No contention between goroutines on different Ps
Understanding these internals lets you:
- Reduce allocation overhead in hot paths
- Minimize memory waste from rounding
- Design data structures that play well with size classes
- Know when to use object pools vs. allocation
- Interpret MemStats and identify allocation bottlenecks
The allocator's design ensures that in well-written Go code, allocation is never the bottleneck — but in poorly-written code, it can be. Profile your code and allocate wisely.