Deep technical exploration of Go's concurrent garbage collector, tri-color marking, write barriers, STW pauses, GC pacer, mark assist, and strategies for reducing GC pressure and latency.

Garbage Collector Deep Dive

The Go garbage collector is one of the most sophisticated pieces of the runtime. It manages the challenging problem of reclaiming memory while keeping pause times low and allowing the program to run concurrently. Understanding how it works is key to writing low-latency Go applications.

GC Algorithm Overview: Concurrent Tri-Color Mark-and-Sweep

Go uses a concurrent, tri-color mark-and-sweep garbage collector. This is not a generational or copying collector — it's a mark-and-sweep design similar to classical algorithms, but with clever concurrency mechanics.

The algorithm has five phases:

Time ──────────────────────────────────────────────────────────────────►

     ┌────────────────────────────────────────────────────────────┐
     │ Phase 0: Sweep Termination                                │
     │ - Finish any sweeping from previous GC cycle              │
     └────────────┬───────────────────────────────────────────────┘
                  │ (brief, ~microseconds)
                  ▼
     ┌────────────────────────────────────────────────────────────┐
     │ Phase 1: Mark Setup (STW)                                 │
     │ - Stop the world (no goroutines running)                  │
     │ - Enable write barrier                                    │
     │ - Scan globals and stacks (all root pointers)             │
     └────────────┬───────────────────────────────────────────────┘
                  │ (STW pause ~0.1-1ms)
                  ▼
     ┌────────────────────────────────────────────────────────────┐
     │ Phase 2: Marking (Concurrent)                             │
     │ - Application goroutines run                              │
     │ - Mark goroutines scan object graph                       │
     │ - Write barrier tracks mutations                          │
     │ - Most time spent here (100-1000ms typical)               │
     └────────────┬───────────────────────────────────────────────┘
                  │ (program runs while marking)
                  ▼
     ┌────────────────────────────────────────────────────────────┐
     │ Phase 3: Mark Termination (STW)                           │
     │ - Stop the world again                                    │
     │ - Re-scan root set (changed since phase 1)                │
     │ - Disable write barrier                                   │
     │ - Determine what's garbage (unmarked objects)             │
     └────────────┬───────────────────────────────────────────────┘
                  │ (STW pause ~0.5-2ms)
                  ▼
     ┌────────────────────────────────────────────────────────────┐
     │ Phase 4: Sweep (Concurrent)                               │
     │ - Application runs                                        │
     │ - Background goroutine sweeps: marks spans as free        │
     │ - Freed memory returns to allocator                       │
     └────────────────────────────────────────────────────────────┘
         (overlaps with next cycle's marking)

Goroutine Timeline:
     Phase 1 Phase 2 Phase 3 Phase 4
     (STW)  (Concurrent) (STW) (Concurrent)
       |      |    |      |       |
       ▼      ▼    ▼      ▼       ▼
    ═══════  ════════════  ════════════════
    Pause   Run (with     Pause  Run
           barriers)      (sweep)

The Tri-Color Invariant: White, Grey, Black

The marking process uses three colors to track object status:

┌──────────────────────────────────────────────────────────┐
│                       Heap Memory                        │
├──────────────────────────────────────────────────────────┤
│                                                          │
│  ◯ White    Grey  ◯ Black                               │
│  ∘ = untouched  ◐ = reachable  ◑ = reachable           │
│  ∘ = might     ◐ = but children  ◑ = and children     │
│  ∘ = be dead    ◐ = not yet     ◑ = fully scanned     │
│                    scanned                              │
│                                                          │
│ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐                    │
│ │      │ │      │ │      │ │      │                    │
│ │◯Obj1 │ │◐Obj2 │ │◑Obj3 │ │◯Obj4 │                    │
│ │      │ │      │ │      │ │      │                    │
│ │white │ │grey  │ │black │ │white │                    │
│ └──────┘ └──────┘ └──────┘ └──────┘                    │
│                                                          │
└──────────────────────────────────────────────────────────┘

Invariant: Black objects can never directly point to white objects.
            (Enforced by write barrier)

If B (black) assigns a pointer to W (white):
    B.field = W

Write barrier intercepts:
    1. Mark W as grey (if not already)
    2. Add W to mark stack
    3. Later, W will be scanned

This prevents W from being mistakenly collected while B points to it.

Marking Process

Start of Marking Phase:

All reachable objects = grey
All unreachable objects = white
No objects are black yet

┌──────────────────────────┐
│ Global variables point to│
│ objects, mark them grey  │
└─────────┬────────────────┘
          │
          ▼
┌──────────────────────────────────────┐
│ Scan grey object:                    │
│ - Follow all pointers in object      │
│ - Mark targets as grey (if white)    │
│ - Mark current object as black       │
└─────────┬──────────────────────────────┘
          │
          ├─ Repeat: scan next grey object
          │
          └─ Until no grey objects remain
                  │
                  ▼
          ┌──────────────────┐
          │ All white objects│
          │ = garbage        │
          │ = collect them   │
          └──────────────────┘

Example walkthrough:

Initial:
    Root: A ── B ── C
               ── D
    E (unreachable)

After marking completes:
    Reachable:
    A (black) ── B (black) ── C (black)
                      ├── D (black)
    Unreachable:
    E (white) ◄─ Garbage! Sweep will free it.

Write Barrier: Protecting Concurrent Safety

When a goroutine modifies a pointer during marking, the write barrier ensures the GC doesn't miss objects. Go uses a hybrid barrier (since Go 1.8):

Dijkstra Insertion Barrier + Yuasa Deletion Barrier = Hybrid

Dijkstra (insertion):
    Any new edge that points from black to white
    must mark target as grey

Yuasa (deletion):
    Any edge deleted that pointed from grey to white
    must preserve the white object (by marking as grey)

Hybrid: Check if source becomes black during this operation

Write Barrier Example

// Goroutine A (application)
type Node struct {
    value int
    next *Node
}

var root *Node = blackObject

// During GC marking (concurrent):
// write barrier is active

func (n *Node) setNext(next *Node) {
    // Before write barrier intercepts:
    n.next = next

    // Write barrier code (inserted by compiler):
    if writeBarrier.enabled && next != nil {
        // If 'next' is white, mark it grey
        // to prevent GC from missing it
        markGrey(next)
    }
}

// Result: even though GC might have already
// scanned 'n', it will see 'next' when it
// scans 'next' (now marked grey)

The write barrier is not free: every pointer write during marking must execute barrier code. This is why aggressive mutations during GC can hurt latency.

STW Pauses: Why They're Necessary

STW = Stop the World: All application goroutines are paused.

Why necessary:

Root scanning (Mark Setup):
- Must scan all goroutine stacks for root pointers
- Stack pointers can only be safely scanned if goroutine isn't running
- Can't lock down 100,000 goroutines efficiently
Final rescan (Mark Termination):
- After concurrent marking, stacks might have changed
- Must verify all old pointers still marked
- Write barrier might have missed some edge cases

Mark Setup STW (Phase 1):
┌───────────────────────────────┐
│ GC: Stop all goroutines       │
├───────────────────────────────┤
│ Scan stacks of all G's:       │
│  for each g in allgs {        │
│    scan(g.stack)              │
│  }                            │
└───────────────────────────────┘
   Typical: 0.1 - 1.0 ms

Mark Termination STW (Phase 3):
┌──────────────────────────────────┐
│ GC: Stop all goroutines again   │
├──────────────────────────────────┤
│ Re-scan roots (changed during    │
│ concurrent marking)              │
│ Disable write barrier            │
│ Set phase = sweep               │
└──────────────────────────────────┘
   Typical: 0.5 - 2.0 ms

Modern systems might see 50-100K goroutines, so STW pauses are typically under 2ms despite the scale.

GC Pacer: Timing GC Triggers

The pacer decides when to trigger the next GC. It aims to keep garbage collection work steady throughout execution, avoiding bursty behavior.

Default policy: GC triggers when heap size doubles (controlled by GOGC).

GOGC=100 (default):
    trigger = heapMarked * 2

GOGC=50:
    trigger = heapMarked * 1.5  (GC runs more often)

GOGC=200:
    trigger = heapMarked * 3    (GC runs less often)

Timeline with GOGC=100:

    HeapSize (MB)
    ▲
 40 │                   GC        GC         GC
    │                ╱╲╱╲╱╲╱╲╱╲╱╲╱╲╱╲╱╲╱╲╱╲
 30 │           ╱╲╱  trigger   trigger    trigger
    │      ╱╲╱╱  ▲= 20MB     ▲= 40MB    ▲= 80MB
 20 │  ╱╲╱──trigger▼
    │   ▼= 10MB
 10 │
    ├──────────────────────────────────────────
    0   10ms    50ms     100ms    150ms   200ms

The pacer estimates GC cost and spreads work across available CPU:

// Simplified GC pacer logic
type gcPacer struct {
    heapMarked uint64      // heap size at last GC
    lastGCTime int64       // timestamp of last GC
    lastGCPauseSat uint64  // how saturated was last cycle?
}

func (p *gcPacer) shouldTriggerGC() bool {
    heapNow := memStats.HeapAlloc
    trigger := p.heapMarked * (100 + GOGC) / 100

    return heapNow >= trigger
}

Starting Go 1.19, GOMEMLIMIT provides a soft memory limit: the GC becomes more aggressive as memory approaches the limit.

// Set a 512MB memory limit
debug.SetMemoryLimit(512 * 1024 * 1024)

// GC will trigger more frequently as heap
// approaches 512MB, regardless of GOGC

Mark Assist: Helping During Allocation

If a goroutine allocates a lot during marking phase, it gets roped into helping with marking:

Application Goroutine Timeline:

Normal Path:
    alloc() ── [no GC] ── return
    (~10ns)

During Mark Phase (no assist needed):
    alloc() ── [check] ── return
    (~40ns, minimal overhead)

During Mark Phase (assist triggered):
    alloc() ── [MARK ASSIST] ── scan objects ── return
    (~1000ns+, significant overhead!)
             (forced to mark 1-10ms of objects)

Mark assist can cause latency spikes:

// High allocation rate during marking
func heavyAllocDuringGC() {
    for i := 0; i < 1e7; i++ {
        x := make([]byte, 1024)
        _ = x
    }
}

// Executed during marking phase:
// - First few allocations: fast (~100ns)
// - Then: MARK ASSIST triggers
// - Subsequent allocations: slow (~1000ns+)
// - Cause: forced to scan 5-10MB of heap

You can observe mark assist in pprof as runtime.gcAssist in CPU profile.

STW Duration Measurement

Measure actual GC pauses:

package main

import (
    "fmt"
    "os"
    "runtime"
    "runtime/debug"
    "time"
)

func main() {
    // Enable GC trace
    f, _ := os.Create("gc-trace.txt")
    defer f.Close()
    debug.SetTraceback("all")

    // Set GODEBUG for GC info
    os.Setenv("GODEBUG", "gctrace=1")

    // Allocate heavily
    for i := 0; i < 100; i++ {
        s := make([]byte, 1024*1024)  // 1MB
        _ = s
    }

    runtime.GC()  // Force GC cycle
}

// Run with:
// GODEBUG=gctrace=1 go run main.go
//
// Output shows:
// gc X @0.123s 4%: X.XXXs CPU, X.XXXs wall, Xms STW mark, Xms STW term
//
// "Xms STW mark" = Phase 1 pause
// "Xms STW term" = Phase 3 pause

Or use runtime.ReadMemStats:

func measureGCPause() {
    var m1, m2 runtime.MemStats

    runtime.ReadMemStats(&m1)
    pausesBefore := m1.PauseNs[(m1.NumGC+255)%256]

    // Do some work
    time.Sleep(100 * time.Millisecond)

    runtime.ReadMemStats(&m2)
    pausesAfter := m2.PauseNs[(m2.NumGC+255)%256]

    fmt.Printf("GC pause: %v ns\n", pausesAfter)
}

GC Tuning Parameters

GOGC

Controls heap growth factor:

// Aggressive GC (less memory, more pause time)
debug.SetGCPercent(25)  // GC when heap = 1.25x

// Lazy GC (more memory, less pause time)
debug.SetGCPercent(200)  // GC when heap = 3x

// Disable GC (not recommended!)
debug.SetGCPercent(-1)  // No automatic GC

GOMEMLIMIT (Go 1.19+)

Sets soft memory ceiling:

// Ensure process doesn't exceed 512MB
debug.SetMemoryLimit(512 * 1024 * 1024)

// GC becomes more aggressive as memory approaches limit
// Better than GOGC for containerized workloads

Manual GC Trigger

// Force GC cycle
runtime.GC()

// Useful for:
// 1. Predictable pause times (run during off-peak)
// 2. Testing GC behavior
// 3. Critical sections (run GC before latency-sensitive code)

Stack Scanning: A Bottleneck

All goroutine stacks must be scanned for root pointers during Mark Setup (STW).

Mark Setup:
┌─────────────────────────────────┐
│ for each goroutine g {          │
│   scan(g.stack)                 │
│ }                               │
└─────────────────────────────────┘

Cost: O(number of goroutines * avg stack size)

Examples:
- 1,000 goroutines × 2KB avg = 2MB scan = ~0.1ms STW
- 100,000 goroutines × 4KB avg = 400MB scan = ~5ms STW
- 1M goroutines × 8KB avg = 8GB scan = ~50ms STW

Many goroutines waiting on channels = deep stacks = longer STW

Optimization: Reduce goroutine count where possible, or keep stacks shallow.

Pointer Detection: What GC Scans

The GC only scans types that contain pointers. Consider:

// Type with pointers → must scan
type WithPointers struct {
    ptr *int
    str string      // string has internal pointer!
    m   map[string]int  // map has internal pointers
}

// Type without pointers → fast allocation, no scan
type WithoutPointers struct {
    b   [1024]byte  // just bytes
    x   int64
    y   float64
}

// GC implication:
// var good [1000]*WithoutPointers  // No scanning needed
// var bad [1000]*WithPointers      // Scanning needed

// In maps:
// map[string]int     ← value is int (no scan)
// map[string]string  ← value is string (contains ptr, scan)
// map[int]*string    ← value is ptr (scan)

Use unsafe.Pointer to check:

import "unsafe"

// Check if type has pointers
func hasPointers(t reflect.Type) bool {
    return unsafe.Sizeof(t) != 0 &&
           t.Kind() != reflect.Array &&
           t.Kind() != reflect.Struct
}

Optimization: When possible, use non-pointer types in hot data structures:

// Bad: map values need scanning
eventMap := make(map[string]*Event)

// Better: values are integers (no scan)
eventIndexMap := make(map[string]int)
events := make([]*Event, 0)

func BenchmarkGCPauseVsGOGC(b *testing.B) {
    tests := []int{25, 50, 100, 200}

    for _, gogc := range tests {
        b.Run(fmt.Sprintf("GOGC=%d", gogc), func(b *testing.B) {
            debug.SetGCPercent(gogc)

            var totalPause time.Duration
            lastGC := 0

            b.ResetTimer()
            for i := 0; i < b.N; i++ {
                // Allocate heavily
                s := make([]byte, 1024)
                _ = s

                // Measure pause times
                var m runtime.MemStats
                runtime.ReadMemStats(&m)
                if m.NumGC > lastGC {
                    lastGC = m.NumGC
                    pause := time.Duration(m.PauseNs[(m.NumGC-1)%256])
                    totalPause += pause
                }
            }
            b.ReportMetric(float64(totalPause.Milliseconds()), "total_pause_ms")
        })
    }
}

// Sample results:
// GOGC=25:  more frequent GC, smaller heaps, more pauses (5-10ms total)
// GOGC=100: balanced (20-30ms total pause across test)
// GOGC=200: less frequent GC, larger heaps, fewer pauses (15-20ms total)

Benchmark: Mark Assist Latency

func BenchmarkMarkAssistLatency(b *testing.B) {
    // Warm up to trigger GC
    allocate := func(n int) {
        for i := 0; i < n; i++ {
            _ = make([]byte, 1024)
        }
    }

    b.Run("before_gc", func(b *testing.B) {
        runtime.GC()
        b.ResetTimer()
        for i := 0; i < b.N; i++ {
            _ = make([]byte, 1024)
        }
    })

    b.Run("during_gc_marking", func(b *testing.B) {
        // Trigger GC and stop during marking
        go func() {
            allocate(10000)
        }()

        time.Sleep(10 * time.Millisecond)  // Let marking start

        b.ResetTimer()
        for i := 0; i < b.N; i++ {
            _ = make([]byte, 1024)  // Measure latency
        }
    })
}

// Sample results:
// before_gc:           ~100ns per allocation
// during_gc_marking:   ~500ns per allocation (mark assist overhead)

Strategies for Low-Latency GC

1. Reduce Allocation Rate

// Bad: allocates inside loop
for item := range items {
    buf := make([]byte, 4096)
    process(buf, item)
}

// Good: reuse buffer
buf := make([]byte, 4096)
for item := range items {
    clearBuffer(buf)
    process(buf, item)
}

2. Use Object Pools

var pool = sync.Pool{
    New: func() interface{} {
        return make([]byte, 4096)
    },
}

// In hot path:
buf := pool.Get().([]byte)
defer pool.Put(buf)
// Use buf

3. Tune GOGC for Workload

// Latency-sensitive workload
debug.SetGCPercent(200)  // Less frequent GC

// Memory-sensitive workload
debug.SetGCPercent(50)   // More frequent GC, smaller heap

// Containerized workload (Go 1.19+)
debug.SetMemoryLimit(containerMemory)  // Automatic tuning

4. Avoid High Allocation During Mark Phase

Identify when marking is happening:

GODEBUG=gctrace=1 ./program 2>&1 | grep "mark"

Schedule heavy allocations before or after GC cycles.

5. Reduce Pointer Density

// Bad: lots of pointers
type Event struct {
    ID *string
    Name *string
    Value *int64
    Data *[]byte
}

// Good: fewer pointers
type Event struct {
    ID, Name string
    Value int64
    Data []byte
}

6. Goroutine Count Awareness

More goroutines = longer STW (stack scanning):

// Profile goroutine count
runtime.NumGoroutine()  // Returns count

// If approaching 100K:
// - Consider worker pool pattern
// - Combine goroutines where possible
// - Use channels to multiplex work

Summary

Go's garbage collector is a marvel of concurrent algorithm design:

Concurrent marking: 95%+ of GC time doesn't pause application
Write barrier: Enables concurrent safety without STW during marking
Tri-color marking: Efficient, proven algorithm for reachability analysis
GC pacer: Distributes GC work evenly throughout execution
Mark assist: Prevents unbounded latency spikes from heavy allocators

To write fast Go code:

Understand GC pause times vary with workload
Profile with gctrace to see actual behavior
Minimize allocations in latency-sensitive paths
Use object pools for frequently allocated types
Keep goroutine counts reasonable
Tune GOGC or GOMEMLIMIT for your workload
Remember: GC is not free, but it's predictable

The GC won't make your program slow if you respect its constraints. Write allocation-efficient code, and the GC will be your friend.

Garbage Collector Deep Dive - Concurrent Tri-Color Mark-and-Sweep