Go Performance Guide
Compiler & Runtime

GOMAXPROCS and Scheduler Tuning

Optimize Go's scheduler by understanding GOMAXPROCS, work stealing, and preemption for CPU-bound and I/O-bound workloads.

GOMAXPROCS controls the maximum number of OS threads that can execute Go code simultaneously. This single parameter fundamentally affects your application's performance, especially in containerized environments where CPU resources are constrained. Understanding and tuning the scheduler is essential for maximizing throughput and responsiveness.

What GOMAXPROCS Controls

GOMAXPROCS defines the number of logical processors available to the Go runtime:

package main

import (
    "fmt"
    "runtime"
)

func main() {
    // Get current GOMAXPROCS setting
    maxProcs := runtime.GOMAXPROCS(-1)
    fmt.Printf("GOMAXPROCS: %d\n", maxProcs)

    // Get number of CPU cores
    numCPU := runtime.NumCPU()
    fmt.Printf("NumCPU: %d\n", numCPU)

    // Change GOMAXPROCS at runtime
    oldProcs := runtime.GOMAXPROCS(4)
    fmt.Printf("Previous GOMAXPROCS: %d\n", oldProcs)

    // Now only 4 OS threads can execute Go code
    newProcs := runtime.GOMAXPROCS(-1)
    fmt.Printf("New GOMAXPROCS: %d\n", newProcs)
}

Default behavior: Go automatically sets GOMAXPROCS to the number of CPUs detected on the machine:

# On a 16-core machine
./app
# Output:
# GOMAXPROCS: 16
# NumCPU: 16

Note: GOMAXPROCS is the number of P's (logical processors) in the G-M-P model, not OS threads. Multiple goroutines can run on each P.

When to Adjust GOMAXPROCS

Default is Usually Correct

For most applications on dedicated servers, the default (NumCPU) is optimal:

# Standard deployment (uses all available CPU cores)
go run app.go

# Explicitly set to use all cores (same as default)
GOMAXPROCS=0 go run app.go

Containerized Environments

Container orchestration systems (Kubernetes, Docker) may limit CPU resources. The Go runtime doesn't automatically detect these limits:

# Docker container with 2-core limit
docker run -c 2048 --cpus 2 myapp

# Inside container, Go still sees 16 cores
./app
# Output: GOMAXPROCS: 16 (but only 2 cores available!)

This mismatch causes thread oversubscription and poor performance. Solution: use the automaxprocs library or explicitly set GOMAXPROCS.

Automaxprocs for Kubernetes

The automaxprocs library automatically detects CPU quotas:

go get github.com/uber-go/automaxprocs

Integration in your application:

package main

import (
    "fmt"
    "runtime"

    _ "go.uber.org/automaxprocs"
)

func main() {
    // automaxprocs automatically sets GOMAXPROCS to CPU quota
    fmt.Printf("GOMAXPROCS: %d\n", runtime.GOMAXPROCS(-1))

    // Application code
}

Example Kubernetes deployment:

apiVersion: v1
kind: Pod
metadata:
  name: myapp
spec:
  containers:
  - name: app
    image: myapp:latest
    resources:
      limits:
        cpu: "2"  # 2 CPU cores limit
      requests:
        cpu: "2"
---
# With automaxprocs, app automatically uses GOMAXPROCS=2

Without automaxprocs, explicitly set:

# Detect CPU quota and set GOMAXPROCS
CPU_LIMIT=$(cat /sys/fs/cgroup/cpu/cpu.cfs_quota_us)
CPU_PERIOD=$(cat /sys/fs/cgroup/cpu/cpu.cfs_period_us)
GOMAXPROCS=$((CPU_LIMIT / CPU_PERIOD))

exec ./app

Go Scheduler Internals: The G-M-P Model

The Go scheduler manages goroutines using three key components:

  • G (Goroutine): A lightweight thread executing Go code. Thousands can exist.
  • M (Machine): An OS thread managed by the Go runtime.
  • P (Processor): A logical processor with a run queue. GOMAXPROCS controls the count.

Architecture overview:

┌─────────────────────────────────────────────────────────┐
│ Go Runtime                                              │
│                                                         │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐  │
│  │ P (P0)       │  │ P (P1)       │  │ P (P2)       │  │
│  │ runq: [G1,G2]│  │ runq: [G3,G4]│  │ runq: [G5]   │  │
│  │              │  │              │  │              │  │
│  │ ┌──────────┐ │  │ ┌──────────┐ │  │ ┌──────────┐ │  │
│  │ │ M0       │ │  │ │ M1       │ │  │ │ M2       │ │  │
│  │ │(OS thread)│  │ │(OS thread) │  │ │(OS thread) │  │
│  └──────────────┘  └──────────────┘  └──────────────┘  │
│        │                  │                  │          │
└────────┼──────────────────┼──────────────────┼──────────┘
         │                  │                  │
    ┌────▼──────────────────▼──────────────────▼────┐
    │ Operating System (Linux, macOS, Windows)      │
    │ CPU Cores: 3 cores available                  │
    └──────────────────────────────────────────────────┘

Key points:

  • Each P has its own run queue of goroutines
  • Each M executes code from its assigned P
  • GOMAXPROCS = number of P's (currently 3 in the diagram)
  • More M's can exist than P's (for blocking I/O)

Work Stealing Algorithm

When one P's run queue is empty, it steals goroutines from other P's:

package main

import (
    "fmt"
    "runtime"
    "sync"
    "time"
)

func worker(id int, wg *sync.WaitGroup) {
    fmt.Printf("Worker %d started on P %d\n", id, runtime.GOMAXPROCS(-1))
    time.Sleep(10 * time.Millisecond)
    wg.Done()
}

func main() {
    runtime.GOMAXPROCS(4)

    // Unbalanced load: all goroutines queued on P0
    var wg sync.WaitGroup
    for i := 0; i < 16; i++ {
        wg.Add(1)
        go worker(i, &wg)
    }

    wg.Wait()
    fmt.Println("All workers complete")
}

Output shows work stealing:

Worker 0 started on P 3
Worker 1 started on P 1
Worker 2 started on P 0
Worker 3 started on P 2
Worker 4 started on P 0  # Work stolen from P0 to P1
Worker 5 started on P 1  # Load balanced
...

Work stealing ensures:

  • No P sits idle if others have work
  • Automatic load balancing across available processors
  • Fair distribution of CPU time

Goroutine Scheduling Overhead

Each context switch between goroutines has a cost:

package main

import (
    "fmt"
    "sync"
    "testing"
    "time"
)

var counter int
var mu sync.Mutex

func incrementWithMutex() {
    mu.Lock()
    counter++
    mu.Unlock()
}

func BenchmarkSchedulingOverhead(b *testing.B) {
    // Sequential: no goroutine switches
    counter = 0
    b.Run("Sequential", func(b *testing.B) {
        for i := 0; i < b.N; i++ {
            incrementWithMutex()
        }
    })

    // Parallel with high contention: many context switches
    b.Run("HighContention", func(b *testing.B) {
        var wg sync.WaitGroup
        for g := 0; g < 16; g++ {
            wg.Add(1)
            go func() {
                for i := 0; i < b.N/16; i++ {
                    incrementWithMutex()
                }
                wg.Done()
            }()
        }
        wg.Wait()
    })

    // Measure context switches
    fmt.Printf("Context switches measure overhead\n")
}

Benchmark results:

go test -bench=Scheduling -benchtime=5s -benchmem
# BenchmarkSchedulingOverhead/Sequential-8        100000000    56.2 ns/op
# BenchmarkSchedulingOverhead/HighContention-8    10000000   523.0 ns/op (9x slower)

Context switching overhead comes from:

  • Cache invalidation (L1/L2 caches flushed)
  • TLB (Translation Lookaside Buffer) misses
  • Register state save/restore
  • Memory barriers for synchronization

Minimize overhead by:

  • Reducing contention on shared resources
  • Using lock-free data structures
  • Batch operations to reduce switching frequency
  • Choose appropriate GOMAXPROCS for your workload

Preemption in Go 1.14+

Go 1.14 introduced asynchronous preemption, preventing tight loops from starving other goroutines:

Go 1.13 behavior (no preemption):

package main

import (
    "fmt"
    "time"
)

func busyWork() {
    // Infinite loop with no preemption points
    for {
        // Do heavy computation
    }
}

func main() {
    go busyWork()      // Goroutine 1: infinite loop, no yield
    go print("Hello")  // Goroutine 2: never runs!

    time.Sleep(1 * time.Second)
    fmt.Println("Program will hang here forever")
}

In Go 1.13, Goroutine 2 never executes because Goroutine 1 never yields control.

Go 1.14+ behavior (with preemption):

The same code now works correctly because the scheduler can preempt Goroutine 1:

# Go 1.14+
go run app.go
# Output:
# Hello
# (Program completes)

Preemption is implemented via signals sent to OS threads, not explicit yield points.

GODEBUG=schedtrace for Scheduler Diagnostics

Enable scheduler tracing to understand scheduling behavior:

GODEBUG=schedtrace=1000 ./app 2>&1 | head -20

Output format:

SCHED 1000ms: gomaxprocs=4 idleprocs=1 threads=5 spinningthreads=1 idlethreads=2 runqueue=0 [4 0 0 2]
SCHED 2000ms: gomaxprocs=4 idleprocs=0 threads=5 spinningthreads=0 idlethreads=0 runqueue=5 [2 1 2 0]
SCHED 3000ms: gomaxprocs=4 idleprocs=2 threads=8 spinningthreads=0 idlethreads=5 runqueue=0 [0 0 1 1]

Interpreting the output:

FieldMeaning
gomaxprocs=44 logical processors configured
idleprocs=11 processor currently idle
threads=55 OS threads created
spinningthreads=11 thread spinning (waiting for work)
idlethreads=22 threads blocked (I/O waiting)
runqueue=00 goroutines in global queue
[4 0 0 2]Per-P run queue lengths (P0: 4, P1: 0, P2: 0, P3: 2)

Analysis example:

# Trace with more detail
GODEBUG=schedtrace=1000,scheddetail=1 ./app 2>&1 | head -30

This shows:

  • When goroutines are scheduled
  • Which P executes which M
  • Time spent in user vs. kernel code
  • Blocking operations

Benchmark: GOMAXPROCS Impact on Different Workloads

CPU-Bound Workload

package main

import (
    "fmt"
    "runtime"
    "sync"
    "testing"
    "time"
)

func cpuBound() int64 {
    // CPU-intensive calculation
    result := int64(0)
    for i := 0; i < 100_000_000; i++ {
        result += int64(i % 97)
    }
    return result
}

func BenchmarkCPUBoundDifferentGOMAXPROCS(b *testing.B) {
    for _, procs := range []int{1, 2, 4, 8, 16} {
        runtime.GOMAXPROCS(procs)
        b.Run(fmt.Sprintf("GOMAXPROCS=%d", procs), func(b *testing.B) {
            var wg sync.WaitGroup
            start := time.Now()

            // Spawn tasks equal to GOMAXPROCS
            for i := 0; i < procs; i++ {
                wg.Add(1)
                go func() {
                    for j := 0; j < b.N/procs; j++ {
                        _ = cpuBound()
                    }
                    wg.Done()
                }()
            }

            wg.Wait()
            elapsed := time.Since(start)
            fmt.Printf("  Time: %v\n", elapsed)
        })
    }
}

Results on a 4-core machine:

go test -bench=BenchmarkCPUBound -benchtime=10s -run=^$ ./...
# BenchmarkCPUBoundDifferentGOMAXPROCS/GOMAXPROCS=1-4    Time: 4.2s
# BenchmarkCPUBoundDifferentGOMAXPROCS/GOMAXPROCS=2-4    Time: 2.1s
# BenchmarkCPUBoundDifferentGOMAXPROCS/GOMAXPROCS=4-4    Time: 1.0s
# BenchmarkCPUBoundDifferentGOMAXPROCS/GOMAXPROCS=8-4    Time: 1.1s (overhead)

Key findings:

  • Linear improvement up to physical core count
  • GOMAXPROCS > NumCPU causes context switch overhead
  • CPU-bound work scales perfectly with GOMAXPROCS

I/O-Bound Workload

package main

import (
    "fmt"
    "io"
    "net/http"
    "runtime"
    "sync"
    "testing"
    "time"
)

func fetchURL(url string) int {
    resp, err := http.Get(url)
    if err != nil {
        return 0
    }
    defer resp.Body.Close()

    bytes, _ := io.ReadAll(resp.Body)
    return len(bytes)
}

func BenchmarkIOBoundDifferentGOMAXPROCS(b *testing.B) {
    urls := []string{
        "https://httpbin.org/delay/1",
        "https://httpbin.org/delay/1",
        "https://httpbin.org/delay/1",
    }

    for _, procs := range []int{1, 2, 4, 8} {
        runtime.GOMAXPROCS(procs)
        b.Run(fmt.Sprintf("GOMAXPROCS=%d", procs), func(b *testing.B) {
            var wg sync.WaitGroup

            for i := 0; i < b.N; i++ {
                wg.Add(1)
                go func() {
                    for _, url := range urls {
                        _ = fetchURL(url)
                    }
                    wg.Done()
                }()
            }

            wg.Wait()
        })
    }
}

Results for I/O-bound work:

# BenchmarkIOBoundDifferentGOMAXPROCS/GOMAXPROCS=1-4    Time: 15.3s
# BenchmarkIOBoundDifferentGOMAXPROCS/GOMAXPROCS=2-4    Time: 15.2s
# BenchmarkIOBoundDifferentGOMAXPROCS/GOMAXPROCS=4-4    Time: 15.1s
# BenchmarkIOBoundDifferentGOMAXPROCS/GOMAXPROCS=8-4    Time: 15.0s

Key findings:

  • Minimal impact from GOMAXPROCS
  • I/O-bound work doesn't compete for CPU
  • Goroutines block during I/O, P's assign to other goroutines
  • Thousands of goroutines can be managed efficiently

Practical GOMAXPROCS Tuning Strategy

Step 1: Identify workload type

// Measure CPU usage percentage
// CPU-bound: 90%+ utilization
// I/O-bound: 20-50% utilization

Step 2: Set GOMAXPROCS appropriately

package main

import (
    "fmt"
    "os"
    "runtime"
    "strconv"
)

func init() {
    // Read GOMAXPROCS from environment or use NumCPU
    if procs := os.Getenv("GOMAXPROCS"); procs != "" {
        if n, err := strconv.Atoi(procs); err == nil {
            runtime.GOMAXPROCS(n)
        }
    }

    fmt.Printf("GOMAXPROCS set to: %d\n", runtime.GOMAXPROCS(-1))
}

func main() {
    // Application code
}

Step 3: Monitor with metrics

# Use GODEBUG to monitor scheduling
GODEBUG=schedtrace=1000 ./app

Recommended settings:

ScenarioGOMAXPROCSRationale
Dedicated server (CPU-bound)NumCPUFull utilization
Kubernetes pod (2 core limit)2Use automaxprocs
Mixed workloadNumCPU / 2 to NumCPUTune based on metrics
Multi-tenant containerExplicit limitPrevent overload

Complete Example: Tunable Application

package main

import (
    "flag"
    "fmt"
    "os"
    "runtime"
    "sync"
    "time"

    _ "go.uber.org/automaxprocs"
)

func main() {
    // Runtime flags
    gomaxprocs := flag.Int("gomaxprocs", 0, "max processors (0=auto)")
    duration := flag.Duration("duration", 10*time.Second, "run duration")
    workload := flag.String("workload", "mixed", "cpu|io|mixed")
    flag.Parse()

    // Auto-detect or set GOMAXPROCS
    if *gomaxprocs > 0 {
        runtime.GOMAXPROCS(*gomaxprocs)
    }

    fmt.Printf("Running with GOMAXPROCS=%d for %v\n",
        runtime.GOMAXPROCS(-1), *duration)

    start := time.Now()
    var wg sync.WaitGroup

    switch *workload {
    case "cpu":
        for i := 0; i < runtime.GOMAXPROCS(-1); i++ {
            wg.Add(1)
            go cpuWork(&wg, start, *duration)
        }
    case "io":
        for i := 0; i < 100; i++ {
            wg.Add(1)
            go ioWork(&wg, start, *duration)
        }
    case "mixed":
        for i := 0; i < 50; i++ {
            wg.Add(1)
            go cpuWork(&wg, start, *duration)
            wg.Add(1)
            go ioWork(&wg, start, *duration)
        }
    }

    wg.Wait()
    fmt.Println("Complete")
}

func cpuWork(wg *sync.WaitGroup, start time.Time, duration time.Duration) {
    defer wg.Done()
    for time.Since(start) < duration {
        sum := int64(0)
        for i := 0; i < 1000000; i++ {
            sum += int64(i % 13)
        }
    }
}

func ioWork(wg *sync.WaitGroup, start time.Time, duration time.Duration) {
    defer wg.Done()
    for time.Since(start) < duration {
        time.Sleep(10 * time.Millisecond)
    }
}

Usage:

# Auto-detect CPU count
./app -workload cpu

# Use explicit GOMAXPROCS
./app -gomaxprocs 2 -workload cpu

# Mixed workload with auto-detection
./app -workload mixed

Summary

GOMAXPROCS is a critical tuning parameter:

  • Default (NumCPU) works for dedicated servers
  • Automaxprocs library recommended for Kubernetes
  • CPU-bound workloads scale linearly to physical core count
  • I/O-bound workloads are GOMAXPROCS-insensitive
  • Preemption (since Go 1.14) prevents starvation
  • Scheduler tracing reveals bottlenecks via GODEBUG

Master GOMAXPROCS to unlock performance in containerized, resource-constrained, and heterogeneous environments.

On this page