Garbage Collector Deep Dive - Concurrent Tri-Color Mark-and-Sweep
Deep technical exploration of Go's concurrent garbage collector, tri-color marking, write barriers, STW pauses, GC pacer, mark assist, and strategies for reducing GC pressure and latency.
Garbage Collector Deep Dive
The Go garbage collector is one of the most sophisticated pieces of the runtime. It manages the challenging problem of reclaiming memory while keeping pause times low and allowing the program to run concurrently. Understanding how it works is key to writing low-latency Go applications.
GC Algorithm Overview: Concurrent Tri-Color Mark-and-Sweep
Go uses a concurrent, tri-color mark-and-sweep garbage collector. This is not a generational or copying collector — it's a mark-and-sweep design similar to classical algorithms, but with clever concurrency mechanics.
The algorithm has five phases:
Time ──────────────────────────────────────────────────────────────────►
┌────────────────────────────────────────────────────────────┐
│ Phase 0: Sweep Termination │
│ - Finish any sweeping from previous GC cycle │
└────────────┬───────────────────────────────────────────────┘
│ (brief, ~microseconds)
▼
┌────────────────────────────────────────────────────────────┐
│ Phase 1: Mark Setup (STW) │
│ - Stop the world (no goroutines running) │
│ - Enable write barrier │
│ - Scan globals and stacks (all root pointers) │
└────────────┬───────────────────────────────────────────────┘
│ (STW pause ~0.1-1ms)
▼
┌────────────────────────────────────────────────────────────┐
│ Phase 2: Marking (Concurrent) │
│ - Application goroutines run │
│ - Mark goroutines scan object graph │
│ - Write barrier tracks mutations │
│ - Most time spent here (100-1000ms typical) │
└────────────┬───────────────────────────────────────────────┘
│ (program runs while marking)
▼
┌────────────────────────────────────────────────────────────┐
│ Phase 3: Mark Termination (STW) │
│ - Stop the world again │
│ - Re-scan root set (changed since phase 1) │
│ - Disable write barrier │
│ - Determine what's garbage (unmarked objects) │
└────────────┬───────────────────────────────────────────────┘
│ (STW pause ~0.5-2ms)
▼
┌────────────────────────────────────────────────────────────┐
│ Phase 4: Sweep (Concurrent) │
│ - Application runs │
│ - Background goroutine sweeps: marks spans as free │
│ - Freed memory returns to allocator │
└────────────────────────────────────────────────────────────┘
(overlaps with next cycle's marking)
Goroutine Timeline:
Phase 1 Phase 2 Phase 3 Phase 4
(STW) (Concurrent) (STW) (Concurrent)
| | | | |
▼ ▼ ▼ ▼ ▼
═══════ ════════════ ════════════════
Pause Run (with Pause Run
barriers) (sweep)The Tri-Color Invariant: White, Grey, Black
The marking process uses three colors to track object status:
┌──────────────────────────────────────────────────────────┐
│ Heap Memory │
├──────────────────────────────────────────────────────────┤
│ │
│ ◯ White Grey ◯ Black │
│ ∘ = untouched ◐ = reachable ◑ = reachable │
│ ∘ = might ◐ = but children ◑ = and children │
│ ∘ = be dead ◐ = not yet ◑ = fully scanned │
│ scanned │
│ │
│ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ │
│ │ │ │ │ │ │ │ │ │
│ │◯Obj1 │ │◐Obj2 │ │◑Obj3 │ │◯Obj4 │ │
│ │ │ │ │ │ │ │ │ │
│ │white │ │grey │ │black │ │white │ │
│ └──────┘ └──────┘ └──────┘ └──────┘ │
│ │
└──────────────────────────────────────────────────────────┘
Invariant: Black objects can never directly point to white objects.
(Enforced by write barrier)
If B (black) assigns a pointer to W (white):
B.field = W
Write barrier intercepts:
1. Mark W as grey (if not already)
2. Add W to mark stack
3. Later, W will be scanned
This prevents W from being mistakenly collected while B points to it.Marking Process
Start of Marking Phase:
All reachable objects = grey
All unreachable objects = white
No objects are black yet
┌──────────────────────────┐
│ Global variables point to│
│ objects, mark them grey │
└─────────┬────────────────┘
│
▼
┌──────────────────────────────────────┐
│ Scan grey object: │
│ - Follow all pointers in object │
│ - Mark targets as grey (if white) │
│ - Mark current object as black │
└─────────┬──────────────────────────────┘
│
├─ Repeat: scan next grey object
│
└─ Until no grey objects remain
│
▼
┌──────────────────┐
│ All white objects│
│ = garbage │
│ = collect them │
└──────────────────┘Example walkthrough:
Initial:
Root: A ── B ── C
── D
E (unreachable)
After marking completes:
Reachable:
A (black) ── B (black) ── C (black)
├── D (black)
Unreachable:
E (white) ◄─ Garbage! Sweep will free it.Write Barrier: Protecting Concurrent Safety
When a goroutine modifies a pointer during marking, the write barrier ensures the GC doesn't miss objects. Go uses a hybrid barrier (since Go 1.8):
Dijkstra Insertion Barrier + Yuasa Deletion Barrier = Hybrid
Dijkstra (insertion):
Any new edge that points from black to white
must mark target as grey
Yuasa (deletion):
Any edge deleted that pointed from grey to white
must preserve the white object (by marking as grey)
Hybrid: Check if source becomes black during this operationWrite Barrier Example
// Goroutine A (application)
type Node struct {
value int
next *Node
}
var root *Node = blackObject
// During GC marking (concurrent):
// write barrier is active
func (n *Node) setNext(next *Node) {
// Before write barrier intercepts:
n.next = next
// Write barrier code (inserted by compiler):
if writeBarrier.enabled && next != nil {
// If 'next' is white, mark it grey
// to prevent GC from missing it
markGrey(next)
}
}
// Result: even though GC might have already
// scanned 'n', it will see 'next' when it
// scans 'next' (now marked grey)The write barrier is not free: every pointer write during marking must execute barrier code. This is why aggressive mutations during GC can hurt latency.
STW Pauses: Why They're Necessary
STW = Stop the World: All application goroutines are paused.
Why necessary:
-
Root scanning (Mark Setup):
- Must scan all goroutine stacks for root pointers
- Stack pointers can only be safely scanned if goroutine isn't running
- Can't lock down 100,000 goroutines efficiently
-
Final rescan (Mark Termination):
- After concurrent marking, stacks might have changed
- Must verify all old pointers still marked
- Write barrier might have missed some edge cases
Mark Setup STW (Phase 1):
┌───────────────────────────────┐
│ GC: Stop all goroutines │
├───────────────────────────────┤
│ Scan stacks of all G's: │
│ for each g in allgs { │
│ scan(g.stack) │
│ } │
└───────────────────────────────┘
Typical: 0.1 - 1.0 ms
Mark Termination STW (Phase 3):
┌──────────────────────────────────┐
│ GC: Stop all goroutines again │
├──────────────────────────────────┤
│ Re-scan roots (changed during │
│ concurrent marking) │
│ Disable write barrier │
│ Set phase = sweep │
└──────────────────────────────────┘
Typical: 0.5 - 2.0 msModern systems might see 50-100K goroutines, so STW pauses are typically under 2ms despite the scale.
GC Pacer: Timing GC Triggers
The pacer decides when to trigger the next GC. It aims to keep garbage collection work steady throughout execution, avoiding bursty behavior.
Default policy: GC triggers when heap size doubles (controlled by GOGC).
GOGC=100 (default):
trigger = heapMarked * 2
GOGC=50:
trigger = heapMarked * 1.5 (GC runs more often)
GOGC=200:
trigger = heapMarked * 3 (GC runs less often)
Timeline with GOGC=100:
HeapSize (MB)
▲
40 │ GC GC GC
│ ╱╲╱╲╱╲╱╲╱╲╱╲╱╲╱╲╱╲╱╲╱╲
30 │ ╱╲╱ trigger trigger trigger
│ ╱╲╱╱ ▲= 20MB ▲= 40MB ▲= 80MB
20 │ ╱╲╱──trigger▼
│ ▼= 10MB
10 │
├──────────────────────────────────────────
0 10ms 50ms 100ms 150ms 200msThe pacer estimates GC cost and spreads work across available CPU:
// Simplified GC pacer logic
type gcPacer struct {
heapMarked uint64 // heap size at last GC
lastGCTime int64 // timestamp of last GC
lastGCPauseSat uint64 // how saturated was last cycle?
}
func (p *gcPacer) shouldTriggerGC() bool {
heapNow := memStats.HeapAlloc
trigger := p.heapMarked * (100 + GOGC) / 100
return heapNow >= trigger
}Starting Go 1.19, GOMEMLIMIT provides a soft memory limit: the GC becomes more aggressive as memory approaches the limit.
// Set a 512MB memory limit
debug.SetMemoryLimit(512 * 1024 * 1024)
// GC will trigger more frequently as heap
// approaches 512MB, regardless of GOGCMark Assist: Helping During Allocation
If a goroutine allocates a lot during marking phase, it gets roped into helping with marking:
Application Goroutine Timeline:
Normal Path:
alloc() ── [no GC] ── return
(~10ns)
During Mark Phase (no assist needed):
alloc() ── [check] ── return
(~40ns, minimal overhead)
During Mark Phase (assist triggered):
alloc() ── [MARK ASSIST] ── scan objects ── return
(~1000ns+, significant overhead!)
(forced to mark 1-10ms of objects)Mark assist can cause latency spikes:
// High allocation rate during marking
func heavyAllocDuringGC() {
for i := 0; i < 1e7; i++ {
x := make([]byte, 1024)
_ = x
}
}
// Executed during marking phase:
// - First few allocations: fast (~100ns)
// - Then: MARK ASSIST triggers
// - Subsequent allocations: slow (~1000ns+)
// - Cause: forced to scan 5-10MB of heapYou can observe mark assist in pprof as runtime.gcAssist in CPU profile.
STW Duration Measurement
Measure actual GC pauses:
package main
import (
"fmt"
"os"
"runtime"
"runtime/debug"
"time"
)
func main() {
// Enable GC trace
f, _ := os.Create("gc-trace.txt")
defer f.Close()
debug.SetTraceback("all")
// Set GODEBUG for GC info
os.Setenv("GODEBUG", "gctrace=1")
// Allocate heavily
for i := 0; i < 100; i++ {
s := make([]byte, 1024*1024) // 1MB
_ = s
}
runtime.GC() // Force GC cycle
}
// Run with:
// GODEBUG=gctrace=1 go run main.go
//
// Output shows:
// gc X @0.123s 4%: X.XXXs CPU, X.XXXs wall, Xms STW mark, Xms STW term
//
// "Xms STW mark" = Phase 1 pause
// "Xms STW term" = Phase 3 pauseOr use runtime.ReadMemStats:
func measureGCPause() {
var m1, m2 runtime.MemStats
runtime.ReadMemStats(&m1)
pausesBefore := m1.PauseNs[(m1.NumGC+255)%256]
// Do some work
time.Sleep(100 * time.Millisecond)
runtime.ReadMemStats(&m2)
pausesAfter := m2.PauseNs[(m2.NumGC+255)%256]
fmt.Printf("GC pause: %v ns\n", pausesAfter)
}GC Tuning Parameters
GOGC
Controls heap growth factor:
// Aggressive GC (less memory, more pause time)
debug.SetGCPercent(25) // GC when heap = 1.25x
// Lazy GC (more memory, less pause time)
debug.SetGCPercent(200) // GC when heap = 3x
// Disable GC (not recommended!)
debug.SetGCPercent(-1) // No automatic GCGOMEMLIMIT (Go 1.19+)
Sets soft memory ceiling:
// Ensure process doesn't exceed 512MB
debug.SetMemoryLimit(512 * 1024 * 1024)
// GC becomes more aggressive as memory approaches limit
// Better than GOGC for containerized workloadsManual GC Trigger
// Force GC cycle
runtime.GC()
// Useful for:
// 1. Predictable pause times (run during off-peak)
// 2. Testing GC behavior
// 3. Critical sections (run GC before latency-sensitive code)Stack Scanning: A Bottleneck
All goroutine stacks must be scanned for root pointers during Mark Setup (STW).
Mark Setup:
┌─────────────────────────────────┐
│ for each goroutine g { │
│ scan(g.stack) │
│ } │
└─────────────────────────────────┘
Cost: O(number of goroutines * avg stack size)
Examples:
- 1,000 goroutines × 2KB avg = 2MB scan = ~0.1ms STW
- 100,000 goroutines × 4KB avg = 400MB scan = ~5ms STW
- 1M goroutines × 8KB avg = 8GB scan = ~50ms STW
Many goroutines waiting on channels = deep stacks = longer STWOptimization: Reduce goroutine count where possible, or keep stacks shallow.
Pointer Detection: What GC Scans
The GC only scans types that contain pointers. Consider:
// Type with pointers → must scan
type WithPointers struct {
ptr *int
str string // string has internal pointer!
m map[string]int // map has internal pointers
}
// Type without pointers → fast allocation, no scan
type WithoutPointers struct {
b [1024]byte // just bytes
x int64
y float64
}
// GC implication:
// var good [1000]*WithoutPointers // No scanning needed
// var bad [1000]*WithPointers // Scanning needed
// In maps:
// map[string]int ← value is int (no scan)
// map[string]string ← value is string (contains ptr, scan)
// map[int]*string ← value is ptr (scan)Use unsafe.Pointer to check:
import "unsafe"
// Check if type has pointers
func hasPointers(t reflect.Type) bool {
return unsafe.Sizeof(t) != 0 &&
t.Kind() != reflect.Array &&
t.Kind() != reflect.Struct
}Optimization: When possible, use non-pointer types in hot data structures:
// Bad: map values need scanning
eventMap := make(map[string]*Event)
// Better: values are integers (no scan)
eventIndexMap := make(map[string]int)
events := make([]*Event, 0)Green Tea GC (Experimental, Go 1.25+)
Upcoming improvements aim to reduce mark-assist overhead by improving spatial locality of marking and better work distribution.
Performance Implications: Benchmarking GC Impact
Benchmark: GC Pause Times vs GOGC
func BenchmarkGCPauseVsGOGC(b *testing.B) {
tests := []int{25, 50, 100, 200}
for _, gogc := range tests {
b.Run(fmt.Sprintf("GOGC=%d", gogc), func(b *testing.B) {
debug.SetGCPercent(gogc)
var totalPause time.Duration
lastGC := 0
b.ResetTimer()
for i := 0; i < b.N; i++ {
// Allocate heavily
s := make([]byte, 1024)
_ = s
// Measure pause times
var m runtime.MemStats
runtime.ReadMemStats(&m)
if m.NumGC > lastGC {
lastGC = m.NumGC
pause := time.Duration(m.PauseNs[(m.NumGC-1)%256])
totalPause += pause
}
}
b.ReportMetric(float64(totalPause.Milliseconds()), "total_pause_ms")
})
}
}
// Sample results:
// GOGC=25: more frequent GC, smaller heaps, more pauses (5-10ms total)
// GOGC=100: balanced (20-30ms total pause across test)
// GOGC=200: less frequent GC, larger heaps, fewer pauses (15-20ms total)Benchmark: Mark Assist Latency
func BenchmarkMarkAssistLatency(b *testing.B) {
// Warm up to trigger GC
allocate := func(n int) {
for i := 0; i < n; i++ {
_ = make([]byte, 1024)
}
}
b.Run("before_gc", func(b *testing.B) {
runtime.GC()
b.ResetTimer()
for i := 0; i < b.N; i++ {
_ = make([]byte, 1024)
}
})
b.Run("during_gc_marking", func(b *testing.B) {
// Trigger GC and stop during marking
go func() {
allocate(10000)
}()
time.Sleep(10 * time.Millisecond) // Let marking start
b.ResetTimer()
for i := 0; i < b.N; i++ {
_ = make([]byte, 1024) // Measure latency
}
})
}
// Sample results:
// before_gc: ~100ns per allocation
// during_gc_marking: ~500ns per allocation (mark assist overhead)Strategies for Low-Latency GC
1. Reduce Allocation Rate
// Bad: allocates inside loop
for item := range items {
buf := make([]byte, 4096)
process(buf, item)
}
// Good: reuse buffer
buf := make([]byte, 4096)
for item := range items {
clearBuffer(buf)
process(buf, item)
}2. Use Object Pools
var pool = sync.Pool{
New: func() interface{} {
return make([]byte, 4096)
},
}
// In hot path:
buf := pool.Get().([]byte)
defer pool.Put(buf)
// Use buf3. Tune GOGC for Workload
// Latency-sensitive workload
debug.SetGCPercent(200) // Less frequent GC
// Memory-sensitive workload
debug.SetGCPercent(50) // More frequent GC, smaller heap
// Containerized workload (Go 1.19+)
debug.SetMemoryLimit(containerMemory) // Automatic tuning4. Avoid High Allocation During Mark Phase
Identify when marking is happening:
GODEBUG=gctrace=1 ./program 2>&1 | grep "mark"Schedule heavy allocations before or after GC cycles.
5. Reduce Pointer Density
// Bad: lots of pointers
type Event struct {
ID *string
Name *string
Value *int64
Data *[]byte
}
// Good: fewer pointers
type Event struct {
ID, Name string
Value int64
Data []byte
}6. Goroutine Count Awareness
More goroutines = longer STW (stack scanning):
// Profile goroutine count
runtime.NumGoroutine() // Returns count
// If approaching 100K:
// - Consider worker pool pattern
// - Combine goroutines where possible
// - Use channels to multiplex workSummary
Go's garbage collector is a marvel of concurrent algorithm design:
- Concurrent marking: 95%+ of GC time doesn't pause application
- Write barrier: Enables concurrent safety without STW during marking
- Tri-color marking: Efficient, proven algorithm for reachability analysis
- GC pacer: Distributes GC work evenly throughout execution
- Mark assist: Prevents unbounded latency spikes from heavy allocators
To write fast Go code:
- Understand GC pause times vary with workload
- Profile with
gctraceto see actual behavior - Minimize allocations in latency-sensitive paths
- Use object pools for frequently allocated types
- Keep goroutine counts reasonable
- Tune GOGC or GOMEMLIMIT for your workload
- Remember: GC is not free, but it's predictable
The GC won't make your program slow if you respect its constraints. Write allocation-efficient code, and the GC will be your friend.