Deep dive into Go's scheduler architecture, the GMP model (Goroutine, Machine, Processor), work stealing, preemption, and how understanding these internals leads to better concurrent code.

Go Scheduler Internals: The GMP Model

The Go scheduler is one of the most elegant pieces of engineering in the language runtime. Understanding how goroutines are actually executed on OS threads reveals why Go's concurrency model is so effective and helps you write faster, more predictable concurrent code.

The GMP Model: Understanding the Three Players

Go's scheduler uses a model called GMP, which consists of three key entities:

G (Goroutine): A lightweight user-space thread managed by the Go runtime
M (Machine): An OS-level thread managed by the kernel
P (Processor): A logical processor that holds context needed to execute a goroutine

The relationship between these three is fundamental to how Go scheduling works:

┌──────────────────────────────────────────────────────────────┐
│                     Go Runtime Scheduler                      │
├──────────────────────────────────────────────────────────────┤
│                                                                │
│  ┌─────────┐  ┌─────────┐  ┌─────────┐                       │
│  │    P    │  │    P    │  │    P    │  ← Logical Processors │
│  │ (mcache)│  │ (mcache)│  │ (mcache)│    (count = GOMAXPROCS)│
│  │  ┌───┐  │  │  ┌───┐  │  │  ┌───┐  │                       │
│  │  │LRQ│  │  │  │LRQ│  │  │  │LRQ│  │  ← Local Run Queue    │
│  │  └───┘  │  │  └───┘  │  │  └───┘  │    (256 slots each)   │
│  └────┬────┘  └────┬────┘  └────┬────┘                       │
│       │            │            │                             │
│  ┌────▼──────────────────────────▼────┐                      │
│  │         Global Run Queue (GRQ)      │                      │
│  │      (mutex-protected, unbounded)   │                      │
│  └───────────────────────────────────┘                       │
│       ▲                                                        │
│       │                                                        │
│  ┌────┴───┐  ┌────────┐  ┌────────┐                          │
│  │   M    │  │   M    │  │   M    │  ← OS Threads           │
│  │ (bound)│  │(blocked)│  │  (idle)│                          │
│  └────────┘  └────────┘  └────────┘                          │
│                                                                │
└──────────────────────────────────────────────────────────────┘

The key insight: The number of P's equals GOMAXPROCS, which is the true parallelism limit. You can have thousands of goroutines, but only GOMAXPROCS can execute simultaneously.

The Data Structures: Inside G, M, and P

The Goroutine Structure (runtime.g)

Each goroutine has its own stack and metadata:

type g struct {
    // Stack boundaries
    stack       stack
    stackguard0 uintptr
    stackguard1 uintptr

    // Scheduling state
    sched       gobuf
    status      uint32      // _Gidle, _Grunnable, _Grunning, etc.
    goid        uint64      // unique goroutine ID

    // Memory and pointers
    param       unsafe.Pointer  // parameter passed to g

    // Panic/recover
    _panic      *_panic
    _defer      *_defer

    // Miscellaneous
    m           *m          // current m (if executing)
    waitreason  waitReason  // why goroutine is waiting
    waitlink    *g          // linked list of waiting goroutines
}

type gobuf struct {
    sp   uintptr  // stack pointer
    pc   uintptr  // program counter (where to resume)
    g    *g       // associated goroutine
    ctxt unsafe.Pointer
    ret  uintptr  // return value
    lr   uintptr  // link register (architecture-specific)
    bp   uintptr  // base pointer (for unwinding)
}

The Machine Structure (runtime.m)

An M represents a real OS thread:

type m struct {
    // Goroutine being executed
    curg        *g      // currently executing goroutine
    p           *p      // associated processor
    spinning    bool    // is this m spinning looking for work?

    // When blocked on syscall
    syscallsp   uintptr
    syscallpc   uintptr
    syscallg    *g      // goroutine that made syscall

    // Platform-specific
    id          int64   // unique thread ID

    // Memory
    malloc      *mcache // memory allocator cache (per-M copy)

    // Lock
    locks       int32   // count of locks held
}

The Processor Structure (runtime.p)

The P is the scheduling unit that holds per-processor state:

type p struct {
    id          int32

    // Scheduling
    runq        [256]guintptr    // local run queue (FIFO)
    runqhead    uint32
    runqtail    uint32

    // M tracking
    m           *m      // currently bound m

    // Memory
    mcache      *mcache // per-P memory allocator cache (faster than per-M)

    // Work stealing hint
    runnext     guintptr // next runnable goroutine (prioritized)

    // Blocking
    timers      []*timer
    timerModifiedEarliest atomic.Int64
}

Goroutine States: The State Machine

A goroutine transitions through several states during its lifetime:

┌──────────┐
│ _Gidle   │ ← newly allocated, not yet scheduled
└────┬─────┘
     │ runtime.newproc()
     ▼
┌──────────┐
│_Grunnable│ ← ready to run, waiting in queue
└────┬─────┘
     │ scheduler picks it up
     ▼
┌──────────┐
│ _Grunning│ ← executing on M
└────┬─────┘
     │ (various events)
     ├─────────────────┬──────────────────┬──────────────────┐
     │                 │                  │                  │
     ▼                 ▼                  ▼                  ▼
┌──────────┐      ┌──────────┐      ┌──────────┐      ┌──────────┐
│_Gwaiting │      │_Gsyscall │      │_Gselect  │      │_Gscan*   │
│(GC, chan,│      │(syscall) │      │(select)  │      │(GC scan) │
│ sleep)   │      │          │      │          │      │          │
└────┬─────┘      └────┬─────┘      └────┬─────┘      └────┬─────┘
     │ event done       │ returns          │ resume          │ done
     │                  │                  │                 │
     └──────────────────┴──────────────────┴─────────────────┘
                        │
                        ▼
                  ┌──────────┐
                  │ _Grunnable│ ← (back to queue)
                  └────┬─────┘
                       │ (eventually)
                       ▼
                  ┌──────────┐
                  │ _Gdead   │ ← finished
                  └──────────┘

The Scheduling Loop: How Goroutines Actually Run

The core of the scheduler is the scheduling loop. Here's how it works:

┌─────────────────────────────────────────┐
│      func schedule() [M.sched()]         │
└──────────────┬──────────────────────────┘
               │
               ▼
    ┌─────────────────────────┐
    │ Check for GC work       │
    │ (every 61 schedule ops) │
    └────────┬────────────────┘
             │
             ▼
    ┌─────────────────────────┐
    │ findRunnable() ← find    │
    │                 next G  │
    └────────┬────────────────┘
             │
             ├──→ Check P's local run queue (LRQ)
             │
             ├──→ Check global run queue (GRQ)
             │
             ├──→ Poll network (check ready network connections)
             │
             ├──→ Try work stealing from other P's LRQ
             │
             └──→ If nothing, park M
             │
             ▼
    ┌─────────────────────────┐
    │ execute(G)              │
    │ - switch to G's stack   │
    │ - restore G's registers │
    │ - run G's code          │
    └────────┬────────────────┘
             │ (G yields or blocks)
             │
             ▼
    ┌─────────────────────────┐
    │ Back to schedule()       │
    └─────────────────────────┘

Local Run Queue (LRQ) vs Global Run Queue (GRQ)

The two-level queue structure is crucial for performance:

Local Run Queue (LRQ)

Per-processor, 256-slot FIFO queue
Lock-free (owned by single M+P pair)
Ultra-fast, no synchronization needed
When full, half of entries are moved to GRQ

Global Run Queue (GRQ)

Shared by all processors
Mutex-protected
Unbounded size
Used for fairness and load balancing

// Simplified scheduling loop from runtime
func schedule() {
    mp := getg().m
    pp := mp.p.ptr()

    gp := findRunnable()  // Find next goroutine

    // Execute the goroutine
    execute(gp, inheritTime)
}

func findRunnable() *g {
    // Step 1: Check local run queue
    if gp, inheritTime := runqget(pp); gp != nil {
        return gp
    }

    // Step 2: Check global run queue (every 61 checks, fairness)
    if pp.schedtick%61 == 0 && sched.runqsize > 0 {
        lock(&sched.lock)
        gp := globrunqget(pp, 1)
        unlock(&sched.lock)
        if gp != nil {
            return gp
        }
    }

    // Step 3: Poll network
    if netpollinited() && sched.lastpoll != 0 {
        if list := netpoll(0); !list.empty() {
            gp := list.pop()
            return gp
        }
    }

    // Step 4: Work stealing (try to steal from other P's)
    for i := 0; i < len(allp); i++ {
        pp2 := allp[(pp.id+i)%len(allp)]
        if len(pp2.runq) > 0 {
            half := len(pp2.runq) / 2
            for i := 0; i < half; i++ {
                runqput(pp, pp2.runq[(pp2.runqhead+i)%256])
            }
            return runqget(pp)
        }
    }

    return nil  // No work available, park M
}

Work Stealing: The Load Balancing Mechanism

When a P's local run queue is empty, it doesn't block. Instead:

Steal attempt: Try to steal half of another P's run queue
GRQ check: Look at the global run queue
Network poll: Check for ready network connections
Park: If truly no work, park the M and put P in the idle pool

This prevents a P with a full queue from hogging work while another P is idle:

P1's LRQ: [G1, G2, G3, G4]    P2's LRQ: [empty]
           ▲                             ▲
           │                             │
           └─────── steal ───────────────┘
                    [G3, G4]

Result:
P1's LRQ: [G1, G2]           P2's LRQ: [G3, G4]

GOMAXPROCS: The Parallelism Limit

runtime.GOMAXPROCS(n) sets the number of P's, which directly controls how many goroutines can run in parallel:

package main

import (
    "fmt"
    "runtime"
    "sync"
    "time"
)

func main() {
    start := time.Now()

    // Single P (serialized execution)
    runtime.GOMAXPROCS(1)
    parallelWork(1)
    fmt.Printf("1P took: %v\n", time.Since(start))

    start = time.Now()
    // 4 P's (parallel execution)
    runtime.GOMAXPROCS(4)
    parallelWork(4)
    fmt.Printf("4P took: %v\n", time.Since(start))
}

func parallelWork(expected int) {
    var wg sync.WaitGroup
    start := time.Now()

    for i := 0; i < expected; i++ {
        wg.Add(1)
        go func() {
            defer wg.Done()
            // CPU-bound work
            sum := 0
            for j := 0; j < 1e8; j++ {
                sum += j
            }
        }()
    }

    wg.Wait()
}

Note: Increasing GOMAXPROCS above the number of physical cores provides no speedup for CPU-bound work. For I/O-bound work, more goroutines can yield higher throughput because one P can schedule other goroutines while the current one blocks on I/O.

Syscall Handoff: Keeping Processors Busy

When a goroutine enters a blocking syscall:

Before Syscall:         During Syscall:         After Syscall:
┌──────┐               ┌──────┐                ┌──────┐
│ M1   │               │ M1   │                │ M1   │
│ ↓    │               │ ↓    │                │      │
│ P1   │   ──────→     │ (blocked in syscall)  │      │
│ ↓    │               │                       │      │
│ G1   │               │      ┌──────┐         │ ┌────┴──┐
│      │               │      │ M2   │         │ │ P1    │
│      │               │      │ ↓    │         │ │ ↓     │
│      │               │      │ P1   │         │ │ M2    │
│      │               │      │ ↓    │         │ │       │
│      │               │      │ G2   │         │ │ G1    │
└──────┘               └──────┘      │         │ │(ready)│
                                     │         │ └───────┘
                                     │         │
                                     └─────────┘

Here's the sequence:

Goroutine calls a blocking syscall
M that was running it tries to execute the syscall
Runtime detects it's a true blocking call
P is unbound from M and assigned to a different M (or new M is created)
That M picks another goroutine from P's run queue
Original M blocks in the OS (kernel)
When syscall returns, the original G is placed back in a run queue

// Simplified syscall handling
func entersyscall(pc uintptr) {
    mp := getg().m
    pp := mp.p.ptr()

    // Save state
    mp.syscallpc = pc
    mp.syscallsp = sp

    // Unbound P from M
    handoffp(pp)

    // Now free to block
}

func exitsyscall() {
    mp := getg().m
    pp := mp.p.ptr()

    // Try to reacquire P
    if pp != nil && pp.m == 0 && len(pp.runq) > 0 {
        acquirep(pp)
    } else {
        // Find another idle P or wait
        mp.p = nil
        gp := getg()
        globrunqput(gp)  // Put back in global queue
    }
}

Network Poller Integration: Non-Blocking I/O

Go doesn't have one thread per goroutine. Instead, goroutines block on I/O while M threads remain available. This is accomplished with the netpoller:

Network Poll Cycle:

┌────────────────────────────────┐
│ Network event arrives (socket) │
│ (e.g., data available to read) │
└────────────┬───────────────────┘
             │
             ▼
┌─────────────────────────────────────┐
│ OS kernel wakes netpoller           │
│ (via epoll/kqueue/ioctl/select)     │
└────────────┬───────────────────────┘
             │
             ▼
┌─────────────────────────────────────┐
│ Runtime.netpoll() called by          │
│ scheduler (during findRunnable())    │
└────────────┬───────────────────────┘
             │
             ▼
┌─────────────────────────────────────┐
│ Retrieve ready goroutines from      │
│ netpoller's list                    │
└────────────┬───────────────────────┘
             │
             ▼
┌─────────────────────────────────────┐
│ Move goroutines back to LRQ or      │
│ GRQ for execution                   │
└────────────┬───────────────────────┘
             │
             ▼
┌──────────────────────────────────┐
│ Goroutine runs, continues        │
│ reading from socket              │
└──────────────────────────────────┘

This is why Go can handle millions of concurrent connections with relatively few OS threads.

Preemption: Yielding the Processor

Cooperative Preemption (Pre-1.14)

In function prologues, the runtime inserted a check:

; At start of every function
CMP rax, [thread_g0_stackguard]
JL stack_overflow_or_preempt

This checked if the stack pointer crossed a guard, triggering preemption. However, if a goroutine did tight loops without function calls, it wouldn't preempt:

// This won't yield in Go < 1.14
func tightLoop() {
    for i := 0; i < 1e9; i++ {
        // do work
    }
}

Asynchronous Preemption (Go 1.14+)

The runtime now sends SIGURG signals to OS threads running goroutines that have run too long:

// Preemption request
if sched.timersReady.Load() > 0 {
    preemptone(pp)  // Send signal to interrupt
}

// Signal handler
func sigurg(sig uint32, info *siginfo, ctx *sigctxt) {
    g := getg()

    // Preempt the goroutine
    g.preempt = true
    g.stackguard0 = stackPreempt  // Force a check
}

This ensures fairness: no goroutine can monopolize a P.

Fairness: Preventing Starvation

Every 61 schedule operations (schedtick++), the scheduler checks the global run queue instead of just the local queue:

func findRunnable() *g {
    if pp.schedtick%61 == 0 && sched.runqsize > 0 {
        lock(&sched.lock)
        gp := globrunqget(pp, 1)
        unlock(&sched.lock)
        if gp != nil {
            return gp
        }
    }
    // ... rest of logic
}

This ensures that if many goroutines are being created and added to the GRQ, older goroutines won't starve waiting in the GRQ while newer ones added to the LRQ keep running.

LockOSThread: Binding to a Machine

Sometimes you need a goroutine to always run on the same OS thread. This is necessary for:

OpenGL contexts (thread-local state)
Thread-local storage
Certain C FFI calls with thread assumptions

package main

import (
    "fmt"
    "runtime"
    "sync"
)

func main() {
    var wg sync.WaitGroup

    for i := 0; i < 4; i++ {
        wg.Add(1)
        go func(id int) {
            defer wg.Done()

            runtime.LockOSThread()
            defer runtime.UnlockOSThread()

            fmt.Printf("Goroutine %d on thread %d\n", id, getThreadID())

        }(i)
    }

    wg.Wait()
}

func getThreadID() uint64 {
    // Platform-specific way to get thread ID
    // On Linux: syscall.Syscall(syscall.SYS_GETTID, 0, 0, 0)
    return 0
}

Note: LockOSThread() reduces scheduler flexibility. Use sparingly.

Debugging Scheduler Behavior

GODEBUG=schedtrace

Enable scheduler tracing to see what the scheduler is doing:

GODEBUG=schedtrace=1000 ./program

Output shows every 1000ms:

SCHED 0ms: gomaxprocs=4 idleprocs=2 threads=8 spinningthreads=0 needspinning=0 idlethreads=1 runqueue=0 [0 0 15 0]

gomaxprocs=4: 4 processors
idleprocs=2: 2 idle processors (not running any goroutine)
threads=8: 8 OS threads created
spinningthreads=0: 0 threads looking for work
[0 0 15 0]: Run queue size per processor

GODEBUG=scheddetail

For even more detail:

GODEBUG=scheddetail=1,schedtrace=1000 ./program

Outputs every goroutine state change.

Performance Implications: Tuning for Speed

Context Switch Cost

Each context switch (M switching from one G to another) costs approximately 200 nanoseconds. This includes:

Saving register state
Loading new goroutine state
TLB flushing (if changing user/kernel space)

Goroutine Creation Cost

Creating a new goroutine costs 2-3 microseconds:

// Benchmark: goroutine creation
func BenchmarkGoroutineCreation(b *testing.B) {
    b.ReportAllocs()
    for i := 0; i < b.N; i++ {
        var wg sync.WaitGroup
        wg.Add(1)
        go func() {
            defer wg.Done()
        }()
        wg.Wait()
    }
}

// Output: ~3µs per goroutine creation + destruction

OS Thread Creation Cost

For comparison, creating an OS thread costs 10-50 microseconds — 5-25x slower than creating a goroutine.

Scheduling Contention

Too many goroutines can cause lock contention on the global run queue:

// Bad: millions of goroutines on single machine
func badPattern() {
    for i := 0; i < 1e7; i++ {
        go quickTask(i)  // Contention on GRQ
    }
}

// Better: batch into worker pool
func goodPattern() {
    const workers = 10000
    taskChan := make(chan Task, 100)

    for i := 0; i < workers; i++ {
        go worker(taskChan)
    }

    for i := 0; i < 1e7; i++ {
        taskChan <- Task{i}
    }
}

Recommendations for Writing Faster Code

Match GOMAXPROCS to core count: For CPU-bound work, keep GOMAXPROCS near your CPU core count
Batch work: Avoid creating millions of short-lived goroutines; batch into worker pools
Minimize allocation during hot paths: The scheduler is fast, but allocator contention still matters
Use buffered channels: Avoid blocking on unbuffered channel operations in tight loops
Be mindful of lock contention: The scheduler has fast paths, but the GRQ is locked
Profile before optimizing: Use pprof to identify actual scheduling bottlenecks

Summary

The GMP scheduler is a masterpiece of systems engineering. By understanding:

How G, M, and P interact
The two-level run queue architecture
Work stealing and fairness mechanisms
Syscall handoff and network polling
Preemption and context switch costs

You can write Go code that plays well with the scheduler, resulting in faster, more scalable applications. The key is understanding that the scheduler is not magic — it's a well-designed state machine that responds predictably to the patterns in your code.

Go Scheduler Internals - The GMP Model