GOMAXPROCS and Scheduler Tuning
Optimize Go's scheduler by understanding GOMAXPROCS, work stealing, and preemption for CPU-bound and I/O-bound workloads.
GOMAXPROCS controls the maximum number of OS threads that can execute Go code simultaneously. This single parameter fundamentally affects your application's performance, especially in containerized environments where CPU resources are constrained. Understanding and tuning the scheduler is essential for maximizing throughput and responsiveness.
What GOMAXPROCS Controls
GOMAXPROCS defines the number of logical processors available to the Go runtime:
package main
import (
"fmt"
"runtime"
)
func main() {
// Get current GOMAXPROCS setting
maxProcs := runtime.GOMAXPROCS(-1)
fmt.Printf("GOMAXPROCS: %d\n", maxProcs)
// Get number of CPU cores
numCPU := runtime.NumCPU()
fmt.Printf("NumCPU: %d\n", numCPU)
// Change GOMAXPROCS at runtime
oldProcs := runtime.GOMAXPROCS(4)
fmt.Printf("Previous GOMAXPROCS: %d\n", oldProcs)
// Now only 4 OS threads can execute Go code
newProcs := runtime.GOMAXPROCS(-1)
fmt.Printf("New GOMAXPROCS: %d\n", newProcs)
}Default behavior: Go automatically sets GOMAXPROCS to the number of CPUs detected on the machine:
# On a 16-core machine
./app
# Output:
# GOMAXPROCS: 16
# NumCPU: 16Note: GOMAXPROCS is the number of P's (logical processors) in the G-M-P model, not OS threads. Multiple goroutines can run on each P.
When to Adjust GOMAXPROCS
Default is Usually Correct
For most applications on dedicated servers, the default (NumCPU) is optimal:
# Standard deployment (uses all available CPU cores)
go run app.go
# Explicitly set to use all cores (same as default)
GOMAXPROCS=0 go run app.goContainerized Environments
Container orchestration systems (Kubernetes, Docker) may limit CPU resources. The Go runtime doesn't automatically detect these limits:
# Docker container with 2-core limit
docker run -c 2048 --cpus 2 myapp
# Inside container, Go still sees 16 cores
./app
# Output: GOMAXPROCS: 16 (but only 2 cores available!)This mismatch causes thread oversubscription and poor performance. Solution: use the automaxprocs library or explicitly set GOMAXPROCS.
Automaxprocs for Kubernetes
The automaxprocs library automatically detects CPU quotas:
go get github.com/uber-go/automaxprocsIntegration in your application:
package main
import (
"fmt"
"runtime"
_ "go.uber.org/automaxprocs"
)
func main() {
// automaxprocs automatically sets GOMAXPROCS to CPU quota
fmt.Printf("GOMAXPROCS: %d\n", runtime.GOMAXPROCS(-1))
// Application code
}Example Kubernetes deployment:
apiVersion: v1
kind: Pod
metadata:
name: myapp
spec:
containers:
- name: app
image: myapp:latest
resources:
limits:
cpu: "2" # 2 CPU cores limit
requests:
cpu: "2"
---
# With automaxprocs, app automatically uses GOMAXPROCS=2Without automaxprocs, explicitly set:
# Detect CPU quota and set GOMAXPROCS
CPU_LIMIT=$(cat /sys/fs/cgroup/cpu/cpu.cfs_quota_us)
CPU_PERIOD=$(cat /sys/fs/cgroup/cpu/cpu.cfs_period_us)
GOMAXPROCS=$((CPU_LIMIT / CPU_PERIOD))
exec ./appGo Scheduler Internals: The G-M-P Model
The Go scheduler manages goroutines using three key components:
- G (Goroutine): A lightweight thread executing Go code. Thousands can exist.
- M (Machine): An OS thread managed by the Go runtime.
- P (Processor): A logical processor with a run queue. GOMAXPROCS controls the count.
Architecture overview:
┌─────────────────────────────────────────────────────────┐
│ Go Runtime │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ P (P0) │ │ P (P1) │ │ P (P2) │ │
│ │ runq: [G1,G2]│ │ runq: [G3,G4]│ │ runq: [G5] │ │
│ │ │ │ │ │ │ │
│ │ ┌──────────┐ │ │ ┌──────────┐ │ │ ┌──────────┐ │ │
│ │ │ M0 │ │ │ │ M1 │ │ │ │ M2 │ │ │
│ │ │(OS thread)│ │ │(OS thread) │ │ │(OS thread) │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │ │ │ │
└────────┼──────────────────┼──────────────────┼──────────┘
│ │ │
┌────▼──────────────────▼──────────────────▼────┐
│ Operating System (Linux, macOS, Windows) │
│ CPU Cores: 3 cores available │
└──────────────────────────────────────────────────┘Key points:
- Each P has its own run queue of goroutines
- Each M executes code from its assigned P
- GOMAXPROCS = number of P's (currently 3 in the diagram)
- More M's can exist than P's (for blocking I/O)
Work Stealing Algorithm
When one P's run queue is empty, it steals goroutines from other P's:
package main
import (
"fmt"
"runtime"
"sync"
"time"
)
func worker(id int, wg *sync.WaitGroup) {
fmt.Printf("Worker %d started on P %d\n", id, runtime.GOMAXPROCS(-1))
time.Sleep(10 * time.Millisecond)
wg.Done()
}
func main() {
runtime.GOMAXPROCS(4)
// Unbalanced load: all goroutines queued on P0
var wg sync.WaitGroup
for i := 0; i < 16; i++ {
wg.Add(1)
go worker(i, &wg)
}
wg.Wait()
fmt.Println("All workers complete")
}Output shows work stealing:
Worker 0 started on P 3
Worker 1 started on P 1
Worker 2 started on P 0
Worker 3 started on P 2
Worker 4 started on P 0 # Work stolen from P0 to P1
Worker 5 started on P 1 # Load balanced
...Work stealing ensures:
- No P sits idle if others have work
- Automatic load balancing across available processors
- Fair distribution of CPU time
Goroutine Scheduling Overhead
Each context switch between goroutines has a cost:
package main
import (
"fmt"
"sync"
"testing"
"time"
)
var counter int
var mu sync.Mutex
func incrementWithMutex() {
mu.Lock()
counter++
mu.Unlock()
}
func BenchmarkSchedulingOverhead(b *testing.B) {
// Sequential: no goroutine switches
counter = 0
b.Run("Sequential", func(b *testing.B) {
for i := 0; i < b.N; i++ {
incrementWithMutex()
}
})
// Parallel with high contention: many context switches
b.Run("HighContention", func(b *testing.B) {
var wg sync.WaitGroup
for g := 0; g < 16; g++ {
wg.Add(1)
go func() {
for i := 0; i < b.N/16; i++ {
incrementWithMutex()
}
wg.Done()
}()
}
wg.Wait()
})
// Measure context switches
fmt.Printf("Context switches measure overhead\n")
}Benchmark results:
go test -bench=Scheduling -benchtime=5s -benchmem
# BenchmarkSchedulingOverhead/Sequential-8 100000000 56.2 ns/op
# BenchmarkSchedulingOverhead/HighContention-8 10000000 523.0 ns/op (9x slower)Context switching overhead comes from:
- Cache invalidation (L1/L2 caches flushed)
- TLB (Translation Lookaside Buffer) misses
- Register state save/restore
- Memory barriers for synchronization
Minimize overhead by:
- Reducing contention on shared resources
- Using lock-free data structures
- Batch operations to reduce switching frequency
- Choose appropriate GOMAXPROCS for your workload
Preemption in Go 1.14+
Go 1.14 introduced asynchronous preemption, preventing tight loops from starving other goroutines:
Go 1.13 behavior (no preemption):
package main
import (
"fmt"
"time"
)
func busyWork() {
// Infinite loop with no preemption points
for {
// Do heavy computation
}
}
func main() {
go busyWork() // Goroutine 1: infinite loop, no yield
go print("Hello") // Goroutine 2: never runs!
time.Sleep(1 * time.Second)
fmt.Println("Program will hang here forever")
}In Go 1.13, Goroutine 2 never executes because Goroutine 1 never yields control.
Go 1.14+ behavior (with preemption):
The same code now works correctly because the scheduler can preempt Goroutine 1:
# Go 1.14+
go run app.go
# Output:
# Hello
# (Program completes)Preemption is implemented via signals sent to OS threads, not explicit yield points.
GODEBUG=schedtrace for Scheduler Diagnostics
Enable scheduler tracing to understand scheduling behavior:
GODEBUG=schedtrace=1000 ./app 2>&1 | head -20Output format:
SCHED 1000ms: gomaxprocs=4 idleprocs=1 threads=5 spinningthreads=1 idlethreads=2 runqueue=0 [4 0 0 2]
SCHED 2000ms: gomaxprocs=4 idleprocs=0 threads=5 spinningthreads=0 idlethreads=0 runqueue=5 [2 1 2 0]
SCHED 3000ms: gomaxprocs=4 idleprocs=2 threads=8 spinningthreads=0 idlethreads=5 runqueue=0 [0 0 1 1]Interpreting the output:
| Field | Meaning |
|---|---|
gomaxprocs=4 | 4 logical processors configured |
idleprocs=1 | 1 processor currently idle |
threads=5 | 5 OS threads created |
spinningthreads=1 | 1 thread spinning (waiting for work) |
idlethreads=2 | 2 threads blocked (I/O waiting) |
runqueue=0 | 0 goroutines in global queue |
[4 0 0 2] | Per-P run queue lengths (P0: 4, P1: 0, P2: 0, P3: 2) |
Analysis example:
# Trace with more detail
GODEBUG=schedtrace=1000,scheddetail=1 ./app 2>&1 | head -30This shows:
- When goroutines are scheduled
- Which P executes which M
- Time spent in user vs. kernel code
- Blocking operations
Benchmark: GOMAXPROCS Impact on Different Workloads
CPU-Bound Workload
package main
import (
"fmt"
"runtime"
"sync"
"testing"
"time"
)
func cpuBound() int64 {
// CPU-intensive calculation
result := int64(0)
for i := 0; i < 100_000_000; i++ {
result += int64(i % 97)
}
return result
}
func BenchmarkCPUBoundDifferentGOMAXPROCS(b *testing.B) {
for _, procs := range []int{1, 2, 4, 8, 16} {
runtime.GOMAXPROCS(procs)
b.Run(fmt.Sprintf("GOMAXPROCS=%d", procs), func(b *testing.B) {
var wg sync.WaitGroup
start := time.Now()
// Spawn tasks equal to GOMAXPROCS
for i := 0; i < procs; i++ {
wg.Add(1)
go func() {
for j := 0; j < b.N/procs; j++ {
_ = cpuBound()
}
wg.Done()
}()
}
wg.Wait()
elapsed := time.Since(start)
fmt.Printf(" Time: %v\n", elapsed)
})
}
}Results on a 4-core machine:
go test -bench=BenchmarkCPUBound -benchtime=10s -run=^$ ./...
# BenchmarkCPUBoundDifferentGOMAXPROCS/GOMAXPROCS=1-4 Time: 4.2s
# BenchmarkCPUBoundDifferentGOMAXPROCS/GOMAXPROCS=2-4 Time: 2.1s
# BenchmarkCPUBoundDifferentGOMAXPROCS/GOMAXPROCS=4-4 Time: 1.0s
# BenchmarkCPUBoundDifferentGOMAXPROCS/GOMAXPROCS=8-4 Time: 1.1s (overhead)Key findings:
- Linear improvement up to physical core count
- GOMAXPROCS > NumCPU causes context switch overhead
- CPU-bound work scales perfectly with GOMAXPROCS
I/O-Bound Workload
package main
import (
"fmt"
"io"
"net/http"
"runtime"
"sync"
"testing"
"time"
)
func fetchURL(url string) int {
resp, err := http.Get(url)
if err != nil {
return 0
}
defer resp.Body.Close()
bytes, _ := io.ReadAll(resp.Body)
return len(bytes)
}
func BenchmarkIOBoundDifferentGOMAXPROCS(b *testing.B) {
urls := []string{
"https://httpbin.org/delay/1",
"https://httpbin.org/delay/1",
"https://httpbin.org/delay/1",
}
for _, procs := range []int{1, 2, 4, 8} {
runtime.GOMAXPROCS(procs)
b.Run(fmt.Sprintf("GOMAXPROCS=%d", procs), func(b *testing.B) {
var wg sync.WaitGroup
for i := 0; i < b.N; i++ {
wg.Add(1)
go func() {
for _, url := range urls {
_ = fetchURL(url)
}
wg.Done()
}()
}
wg.Wait()
})
}
}Results for I/O-bound work:
# BenchmarkIOBoundDifferentGOMAXPROCS/GOMAXPROCS=1-4 Time: 15.3s
# BenchmarkIOBoundDifferentGOMAXPROCS/GOMAXPROCS=2-4 Time: 15.2s
# BenchmarkIOBoundDifferentGOMAXPROCS/GOMAXPROCS=4-4 Time: 15.1s
# BenchmarkIOBoundDifferentGOMAXPROCS/GOMAXPROCS=8-4 Time: 15.0sKey findings:
- Minimal impact from GOMAXPROCS
- I/O-bound work doesn't compete for CPU
- Goroutines block during I/O, P's assign to other goroutines
- Thousands of goroutines can be managed efficiently
Practical GOMAXPROCS Tuning Strategy
Step 1: Identify workload type
// Measure CPU usage percentage
// CPU-bound: 90%+ utilization
// I/O-bound: 20-50% utilizationStep 2: Set GOMAXPROCS appropriately
package main
import (
"fmt"
"os"
"runtime"
"strconv"
)
func init() {
// Read GOMAXPROCS from environment or use NumCPU
if procs := os.Getenv("GOMAXPROCS"); procs != "" {
if n, err := strconv.Atoi(procs); err == nil {
runtime.GOMAXPROCS(n)
}
}
fmt.Printf("GOMAXPROCS set to: %d\n", runtime.GOMAXPROCS(-1))
}
func main() {
// Application code
}Step 3: Monitor with metrics
# Use GODEBUG to monitor scheduling
GODEBUG=schedtrace=1000 ./appRecommended settings:
| Scenario | GOMAXPROCS | Rationale |
|---|---|---|
| Dedicated server (CPU-bound) | NumCPU | Full utilization |
| Kubernetes pod (2 core limit) | 2 | Use automaxprocs |
| Mixed workload | NumCPU / 2 to NumCPU | Tune based on metrics |
| Multi-tenant container | Explicit limit | Prevent overload |
Complete Example: Tunable Application
package main
import (
"flag"
"fmt"
"os"
"runtime"
"sync"
"time"
_ "go.uber.org/automaxprocs"
)
func main() {
// Runtime flags
gomaxprocs := flag.Int("gomaxprocs", 0, "max processors (0=auto)")
duration := flag.Duration("duration", 10*time.Second, "run duration")
workload := flag.String("workload", "mixed", "cpu|io|mixed")
flag.Parse()
// Auto-detect or set GOMAXPROCS
if *gomaxprocs > 0 {
runtime.GOMAXPROCS(*gomaxprocs)
}
fmt.Printf("Running with GOMAXPROCS=%d for %v\n",
runtime.GOMAXPROCS(-1), *duration)
start := time.Now()
var wg sync.WaitGroup
switch *workload {
case "cpu":
for i := 0; i < runtime.GOMAXPROCS(-1); i++ {
wg.Add(1)
go cpuWork(&wg, start, *duration)
}
case "io":
for i := 0; i < 100; i++ {
wg.Add(1)
go ioWork(&wg, start, *duration)
}
case "mixed":
for i := 0; i < 50; i++ {
wg.Add(1)
go cpuWork(&wg, start, *duration)
wg.Add(1)
go ioWork(&wg, start, *duration)
}
}
wg.Wait()
fmt.Println("Complete")
}
func cpuWork(wg *sync.WaitGroup, start time.Time, duration time.Duration) {
defer wg.Done()
for time.Since(start) < duration {
sum := int64(0)
for i := 0; i < 1000000; i++ {
sum += int64(i % 13)
}
}
}
func ioWork(wg *sync.WaitGroup, start time.Time, duration time.Duration) {
defer wg.Done()
for time.Since(start) < duration {
time.Sleep(10 * time.Millisecond)
}
}Usage:
# Auto-detect CPU count
./app -workload cpu
# Use explicit GOMAXPROCS
./app -gomaxprocs 2 -workload cpu
# Mixed workload with auto-detection
./app -workload mixedSummary
GOMAXPROCS is a critical tuning parameter:
- Default (NumCPU) works for dedicated servers
- Automaxprocs library recommended for Kubernetes
- CPU-bound workloads scale linearly to physical core count
- I/O-bound workloads are GOMAXPROCS-insensitive
- Preemption (since Go 1.14) prevents starvation
- Scheduler tracing reveals bottlenecks via GODEBUG
Master GOMAXPROCS to unlock performance in containerized, resource-constrained, and heterogeneous environments.