Go Scheduler Internals - The GMP Model
Deep dive into Go's scheduler architecture, the GMP model (Goroutine, Machine, Processor), work stealing, preemption, and how understanding these internals leads to better concurrent code.
Go Scheduler Internals: The GMP Model
The Go scheduler is one of the most elegant pieces of engineering in the language runtime. Understanding how goroutines are actually executed on OS threads reveals why Go's concurrency model is so effective and helps you write faster, more predictable concurrent code.
The GMP Model: Understanding the Three Players
Go's scheduler uses a model called GMP, which consists of three key entities:
- G (Goroutine): A lightweight user-space thread managed by the Go runtime
- M (Machine): An OS-level thread managed by the kernel
- P (Processor): A logical processor that holds context needed to execute a goroutine
The relationship between these three is fundamental to how Go scheduling works:
┌──────────────────────────────────────────────────────────────┐
│ Go Runtime Scheduler │
├──────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ P │ │ P │ │ P │ ← Logical Processors │
│ │ (mcache)│ │ (mcache)│ │ (mcache)│ (count = GOMAXPROCS)│
│ │ ┌───┐ │ │ ┌───┐ │ │ ┌───┐ │ │
│ │ │LRQ│ │ │ │LRQ│ │ │ │LRQ│ │ ← Local Run Queue │
│ │ └───┘ │ │ └───┘ │ │ └───┘ │ (256 slots each) │
│ └────┬────┘ └────┬────┘ └────┬────┘ │
│ │ │ │ │
│ ┌────▼──────────────────────────▼────┐ │
│ │ Global Run Queue (GRQ) │ │
│ │ (mutex-protected, unbounded) │ │
│ └───────────────────────────────────┘ │
│ ▲ │
│ │ │
│ ┌────┴───┐ ┌────────┐ ┌────────┐ │
│ │ M │ │ M │ │ M │ ← OS Threads │
│ │ (bound)│ │(blocked)│ │ (idle)│ │
│ └────────┘ └────────┘ └────────┘ │
│ │
└──────────────────────────────────────────────────────────────┘The key insight: The number of P's equals GOMAXPROCS, which is the true parallelism limit. You can have thousands of goroutines, but only GOMAXPROCS can execute simultaneously.
The Data Structures: Inside G, M, and P
The Goroutine Structure (runtime.g)
Each goroutine has its own stack and metadata:
type g struct {
// Stack boundaries
stack stack
stackguard0 uintptr
stackguard1 uintptr
// Scheduling state
sched gobuf
status uint32 // _Gidle, _Grunnable, _Grunning, etc.
goid uint64 // unique goroutine ID
// Memory and pointers
param unsafe.Pointer // parameter passed to g
// Panic/recover
_panic *_panic
_defer *_defer
// Miscellaneous
m *m // current m (if executing)
waitreason waitReason // why goroutine is waiting
waitlink *g // linked list of waiting goroutines
}
type gobuf struct {
sp uintptr // stack pointer
pc uintptr // program counter (where to resume)
g *g // associated goroutine
ctxt unsafe.Pointer
ret uintptr // return value
lr uintptr // link register (architecture-specific)
bp uintptr // base pointer (for unwinding)
}The Machine Structure (runtime.m)
An M represents a real OS thread:
type m struct {
// Goroutine being executed
curg *g // currently executing goroutine
p *p // associated processor
spinning bool // is this m spinning looking for work?
// When blocked on syscall
syscallsp uintptr
syscallpc uintptr
syscallg *g // goroutine that made syscall
// Platform-specific
id int64 // unique thread ID
// Memory
malloc *mcache // memory allocator cache (per-M copy)
// Lock
locks int32 // count of locks held
}The Processor Structure (runtime.p)
The P is the scheduling unit that holds per-processor state:
type p struct {
id int32
// Scheduling
runq [256]guintptr // local run queue (FIFO)
runqhead uint32
runqtail uint32
// M tracking
m *m // currently bound m
// Memory
mcache *mcache // per-P memory allocator cache (faster than per-M)
// Work stealing hint
runnext guintptr // next runnable goroutine (prioritized)
// Blocking
timers []*timer
timerModifiedEarliest atomic.Int64
}Goroutine States: The State Machine
A goroutine transitions through several states during its lifetime:
┌──────────┐
│ _Gidle │ ← newly allocated, not yet scheduled
└────┬─────┘
│ runtime.newproc()
▼
┌──────────┐
│_Grunnable│ ← ready to run, waiting in queue
└────┬─────┘
│ scheduler picks it up
▼
┌──────────┐
│ _Grunning│ ← executing on M
└────┬─────┘
│ (various events)
├─────────────────┬──────────────────┬──────────────────┐
│ │ │ │
▼ ▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
│_Gwaiting │ │_Gsyscall │ │_Gselect │ │_Gscan* │
│(GC, chan,│ │(syscall) │ │(select) │ │(GC scan) │
│ sleep) │ │ │ │ │ │ │
└────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘
│ event done │ returns │ resume │ done
│ │ │ │
└──────────────────┴──────────────────┴─────────────────┘
│
▼
┌──────────┐
│ _Grunnable│ ← (back to queue)
└────┬─────┘
│ (eventually)
▼
┌──────────┐
│ _Gdead │ ← finished
└──────────┘The Scheduling Loop: How Goroutines Actually Run
The core of the scheduler is the scheduling loop. Here's how it works:
┌─────────────────────────────────────────┐
│ func schedule() [M.sched()] │
└──────────────┬──────────────────────────┘
│
▼
┌─────────────────────────┐
│ Check for GC work │
│ (every 61 schedule ops) │
└────────┬────────────────┘
│
▼
┌─────────────────────────┐
│ findRunnable() ← find │
│ next G │
└────────┬────────────────┘
│
├──→ Check P's local run queue (LRQ)
│
├──→ Check global run queue (GRQ)
│
├──→ Poll network (check ready network connections)
│
├──→ Try work stealing from other P's LRQ
│
└──→ If nothing, park M
│
▼
┌─────────────────────────┐
│ execute(G) │
│ - switch to G's stack │
│ - restore G's registers │
│ - run G's code │
└────────┬────────────────┘
│ (G yields or blocks)
│
▼
┌─────────────────────────┐
│ Back to schedule() │
└─────────────────────────┘Local Run Queue (LRQ) vs Global Run Queue (GRQ)
The two-level queue structure is crucial for performance:
Local Run Queue (LRQ)
- Per-processor, 256-slot FIFO queue
- Lock-free (owned by single M+P pair)
- Ultra-fast, no synchronization needed
- When full, half of entries are moved to GRQ
Global Run Queue (GRQ)
- Shared by all processors
- Mutex-protected
- Unbounded size
- Used for fairness and load balancing
// Simplified scheduling loop from runtime
func schedule() {
mp := getg().m
pp := mp.p.ptr()
gp := findRunnable() // Find next goroutine
// Execute the goroutine
execute(gp, inheritTime)
}
func findRunnable() *g {
// Step 1: Check local run queue
if gp, inheritTime := runqget(pp); gp != nil {
return gp
}
// Step 2: Check global run queue (every 61 checks, fairness)
if pp.schedtick%61 == 0 && sched.runqsize > 0 {
lock(&sched.lock)
gp := globrunqget(pp, 1)
unlock(&sched.lock)
if gp != nil {
return gp
}
}
// Step 3: Poll network
if netpollinited() && sched.lastpoll != 0 {
if list := netpoll(0); !list.empty() {
gp := list.pop()
return gp
}
}
// Step 4: Work stealing (try to steal from other P's)
for i := 0; i < len(allp); i++ {
pp2 := allp[(pp.id+i)%len(allp)]
if len(pp2.runq) > 0 {
half := len(pp2.runq) / 2
for i := 0; i < half; i++ {
runqput(pp, pp2.runq[(pp2.runqhead+i)%256])
}
return runqget(pp)
}
}
return nil // No work available, park M
}Work Stealing: The Load Balancing Mechanism
When a P's local run queue is empty, it doesn't block. Instead:
- Steal attempt: Try to steal half of another P's run queue
- GRQ check: Look at the global run queue
- Network poll: Check for ready network connections
- Park: If truly no work, park the M and put P in the idle pool
This prevents a P with a full queue from hogging work while another P is idle:
P1's LRQ: [G1, G2, G3, G4] P2's LRQ: [empty]
▲ ▲
│ │
└─────── steal ───────────────┘
[G3, G4]
Result:
P1's LRQ: [G1, G2] P2's LRQ: [G3, G4]GOMAXPROCS: The Parallelism Limit
runtime.GOMAXPROCS(n) sets the number of P's, which directly controls how many goroutines can run in parallel:
package main
import (
"fmt"
"runtime"
"sync"
"time"
)
func main() {
start := time.Now()
// Single P (serialized execution)
runtime.GOMAXPROCS(1)
parallelWork(1)
fmt.Printf("1P took: %v\n", time.Since(start))
start = time.Now()
// 4 P's (parallel execution)
runtime.GOMAXPROCS(4)
parallelWork(4)
fmt.Printf("4P took: %v\n", time.Since(start))
}
func parallelWork(expected int) {
var wg sync.WaitGroup
start := time.Now()
for i := 0; i < expected; i++ {
wg.Add(1)
go func() {
defer wg.Done()
// CPU-bound work
sum := 0
for j := 0; j < 1e8; j++ {
sum += j
}
}()
}
wg.Wait()
}Note: Increasing GOMAXPROCS above the number of physical cores provides no speedup for CPU-bound work. For I/O-bound work, more goroutines can yield higher throughput because one P can schedule other goroutines while the current one blocks on I/O.
Syscall Handoff: Keeping Processors Busy
When a goroutine enters a blocking syscall:
Before Syscall: During Syscall: After Syscall:
┌──────┐ ┌──────┐ ┌──────┐
│ M1 │ │ M1 │ │ M1 │
│ ↓ │ │ ↓ │ │ │
│ P1 │ ──────→ │ (blocked in syscall) │ │
│ ↓ │ │ │ │
│ G1 │ │ ┌──────┐ │ ┌────┴──┐
│ │ │ │ M2 │ │ │ P1 │
│ │ │ │ ↓ │ │ │ ↓ │
│ │ │ │ P1 │ │ │ M2 │
│ │ │ │ ↓ │ │ │ │
│ │ │ │ G2 │ │ │ G1 │
└──────┘ └──────┘ │ │ │(ready)│
│ │ └───────┘
│ │
└─────────┘Here's the sequence:
- Goroutine calls a blocking syscall
- M that was running it tries to execute the syscall
- Runtime detects it's a true blocking call
- P is unbound from M and assigned to a different M (or new M is created)
- That M picks another goroutine from P's run queue
- Original M blocks in the OS (kernel)
- When syscall returns, the original G is placed back in a run queue
// Simplified syscall handling
func entersyscall(pc uintptr) {
mp := getg().m
pp := mp.p.ptr()
// Save state
mp.syscallpc = pc
mp.syscallsp = sp
// Unbound P from M
handoffp(pp)
// Now free to block
}
func exitsyscall() {
mp := getg().m
pp := mp.p.ptr()
// Try to reacquire P
if pp != nil && pp.m == 0 && len(pp.runq) > 0 {
acquirep(pp)
} else {
// Find another idle P or wait
mp.p = nil
gp := getg()
globrunqput(gp) // Put back in global queue
}
}Network Poller Integration: Non-Blocking I/O
Go doesn't have one thread per goroutine. Instead, goroutines block on I/O while M threads remain available. This is accomplished with the netpoller:
Network Poll Cycle:
┌────────────────────────────────┐
│ Network event arrives (socket) │
│ (e.g., data available to read) │
└────────────┬───────────────────┘
│
▼
┌─────────────────────────────────────┐
│ OS kernel wakes netpoller │
│ (via epoll/kqueue/ioctl/select) │
└────────────┬───────────────────────┘
│
▼
┌─────────────────────────────────────┐
│ Runtime.netpoll() called by │
│ scheduler (during findRunnable()) │
└────────────┬───────────────────────┘
│
▼
┌─────────────────────────────────────┐
│ Retrieve ready goroutines from │
│ netpoller's list │
└────────────┬───────────────────────┘
│
▼
┌─────────────────────────────────────┐
│ Move goroutines back to LRQ or │
│ GRQ for execution │
└────────────┬───────────────────────┘
│
▼
┌──────────────────────────────────┐
│ Goroutine runs, continues │
│ reading from socket │
└──────────────────────────────────┘This is why Go can handle millions of concurrent connections with relatively few OS threads.
Preemption: Yielding the Processor
Cooperative Preemption (Pre-1.14)
In function prologues, the runtime inserted a check:
; At start of every function
CMP rax, [thread_g0_stackguard]
JL stack_overflow_or_preemptThis checked if the stack pointer crossed a guard, triggering preemption. However, if a goroutine did tight loops without function calls, it wouldn't preempt:
// This won't yield in Go < 1.14
func tightLoop() {
for i := 0; i < 1e9; i++ {
// do work
}
}Asynchronous Preemption (Go 1.14+)
The runtime now sends SIGURG signals to OS threads running goroutines that have run too long:
// Preemption request
if sched.timersReady.Load() > 0 {
preemptone(pp) // Send signal to interrupt
}
// Signal handler
func sigurg(sig uint32, info *siginfo, ctx *sigctxt) {
g := getg()
// Preempt the goroutine
g.preempt = true
g.stackguard0 = stackPreempt // Force a check
}This ensures fairness: no goroutine can monopolize a P.
Fairness: Preventing Starvation
Every 61 schedule operations (schedtick++), the scheduler checks the global run queue instead of just the local queue:
func findRunnable() *g {
if pp.schedtick%61 == 0 && sched.runqsize > 0 {
lock(&sched.lock)
gp := globrunqget(pp, 1)
unlock(&sched.lock)
if gp != nil {
return gp
}
}
// ... rest of logic
}This ensures that if many goroutines are being created and added to the GRQ, older goroutines won't starve waiting in the GRQ while newer ones added to the LRQ keep running.
LockOSThread: Binding to a Machine
Sometimes you need a goroutine to always run on the same OS thread. This is necessary for:
- OpenGL contexts (thread-local state)
- Thread-local storage
- Certain C FFI calls with thread assumptions
package main
import (
"fmt"
"runtime"
"sync"
)
func main() {
var wg sync.WaitGroup
for i := 0; i < 4; i++ {
wg.Add(1)
go func(id int) {
defer wg.Done()
runtime.LockOSThread()
defer runtime.UnlockOSThread()
fmt.Printf("Goroutine %d on thread %d\n", id, getThreadID())
}(i)
}
wg.Wait()
}
func getThreadID() uint64 {
// Platform-specific way to get thread ID
// On Linux: syscall.Syscall(syscall.SYS_GETTID, 0, 0, 0)
return 0
}Note: LockOSThread() reduces scheduler flexibility. Use sparingly.
Debugging Scheduler Behavior
GODEBUG=schedtrace
Enable scheduler tracing to see what the scheduler is doing:
GODEBUG=schedtrace=1000 ./programOutput shows every 1000ms:
SCHED 0ms: gomaxprocs=4 idleprocs=2 threads=8 spinningthreads=0 needspinning=0 idlethreads=1 runqueue=0 [0 0 15 0]gomaxprocs=4: 4 processorsidleprocs=2: 2 idle processors (not running any goroutine)threads=8: 8 OS threads createdspinningthreads=0: 0 threads looking for work[0 0 15 0]: Run queue size per processor
GODEBUG=scheddetail
For even more detail:
GODEBUG=scheddetail=1,schedtrace=1000 ./programOutputs every goroutine state change.
Performance Implications: Tuning for Speed
Context Switch Cost
Each context switch (M switching from one G to another) costs approximately 200 nanoseconds. This includes:
- Saving register state
- Loading new goroutine state
- TLB flushing (if changing user/kernel space)
Goroutine Creation Cost
Creating a new goroutine costs 2-3 microseconds:
// Benchmark: goroutine creation
func BenchmarkGoroutineCreation(b *testing.B) {
b.ReportAllocs()
for i := 0; i < b.N; i++ {
var wg sync.WaitGroup
wg.Add(1)
go func() {
defer wg.Done()
}()
wg.Wait()
}
}
// Output: ~3µs per goroutine creation + destructionOS Thread Creation Cost
For comparison, creating an OS thread costs 10-50 microseconds — 5-25x slower than creating a goroutine.
Scheduling Contention
Too many goroutines can cause lock contention on the global run queue:
// Bad: millions of goroutines on single machine
func badPattern() {
for i := 0; i < 1e7; i++ {
go quickTask(i) // Contention on GRQ
}
}
// Better: batch into worker pool
func goodPattern() {
const workers = 10000
taskChan := make(chan Task, 100)
for i := 0; i < workers; i++ {
go worker(taskChan)
}
for i := 0; i < 1e7; i++ {
taskChan <- Task{i}
}
}Recommendations for Writing Faster Code
- Match GOMAXPROCS to core count: For CPU-bound work, keep
GOMAXPROCSnear your CPU core count - Batch work: Avoid creating millions of short-lived goroutines; batch into worker pools
- Minimize allocation during hot paths: The scheduler is fast, but allocator contention still matters
- Use buffered channels: Avoid blocking on unbuffered channel operations in tight loops
- Be mindful of lock contention: The scheduler has fast paths, but the GRQ is locked
- Profile before optimizing: Use
pprofto identify actual scheduling bottlenecks
Summary
The GMP scheduler is a masterpiece of systems engineering. By understanding:
- How G, M, and P interact
- The two-level run queue architecture
- Work stealing and fairness mechanisms
- Syscall handoff and network polling
- Preemption and context switch costs
You can write Go code that plays well with the scheduler, resulting in faster, more scalable applications. The key is understanding that the scheduler is not magic — it's a well-designed state machine that responds predictably to the patterns in your code.