Go Performance Guide
Profiling & Benchmarking

Execution Tracing

Use Go's execution tracer to visualize goroutine scheduling, GC events, and I/O operations

Execution Tracing

While pprof shows where CPU time goes, execution tracing reveals when and why things happen. It captures goroutine scheduling, garbage collection events, network I/O, and synchronization points—essential for understanding concurrent performance issues.

Runtime/trace Fundamentals

The runtime/trace package provides low-level tracing capabilities:

package main

import (
    "fmt"
    "os"
    "runtime/trace"
)

func main() {
    // Start tracing
    f, err := os.Create("trace.out")
    if err != nil {
        panic(err)
    }
    defer f.Close()

    if err := trace.Start(f); err != nil {
        panic(err)
    }
    defer trace.Stop()

    // Application runs here with tracing active
    doWork()

    fmt.Println("Trace written to trace.out")
}

func doWork() {
    // Simulate some concurrent work
    done := make(chan struct{})
    for i := 0; i < 100; i++ {
        go func() {
            defer func() { done <- struct{}{} }()
            // Simulate work
            sum := 0
            for j := 0; j < 1000000; j++ {
                sum += j
            }
        }()
    }

    // Wait for completion
    for i := 0; i < 100; i++ {
        <-done
    }
}

Run and analyze:

go run main.go
go tool trace trace.out

This opens a web-based interactive viewer in your browser.

Integration with net/http/pprof

For HTTP services, add tracing endpoints:

package main

import (
    "net/http"
    _ "net/http/pprof"
)

func main() {
    // Tracing is available at /debug/pprof/trace
    // Capture 5-second trace:
    // curl http://localhost:6060/debug/pprof/trace?seconds=5 > trace.out
    // go tool trace trace.out

    http.HandleFunc("/api/data", func(w http.ResponseWriter, r *http.Request) {
        w.Write([]byte("OK"))
    })

    http.ListenAndServe(":6060", nil)
}

Capture traces from running services:

curl http://localhost:6060/debug/pprof/trace?seconds=10 > trace.out
go tool trace trace.out

Understanding the Trace Viewer

The trace viewer UI displays several critical views:

Goroutine Timeline

Shows each goroutine's execution over time:

  • Green: Running on CPU
  • Orange: Blocked (network, locks, channels)
  • Gray: Runnable but waiting for CPU

Network Blocking Profile

Displays I/O operations and their duration:

package main

import (
    "io"
    "net"
    "os"
    "runtime/trace"
)

func main() {
    f, _ := os.Create("trace.out")
    defer f.Close()
    trace.Start(f)
    defer trace.Stop()

    // Network I/O appears in trace
    conn, _ := net.Dial("tcp", "example.com:80")
    defer conn.Close()

    io.Copy(io.Discard, conn)
}

In the trace viewer, you'll see network blocking times annotated.

GC Events

The trace captures every garbage collection:

package main

import (
    "fmt"
    "os"
    "runtime"
    "runtime/trace"
)

func main() {
    f, _ := os.Create("trace.out")
    defer f.Close()
    trace.Start(f)
    defer trace.Stop()

    // Trigger allocations to see GC in trace
    for i := 0; i < 10; i++ {
        data := make([]byte, 10000000)
        _ = data
    }

    var m runtime.MemStats
    runtime.ReadMemStats(&m)
    fmt.Printf("GC runs: %d\n", m.NumGC)
}

Trace shows:

  • Mark phase duration
  • Sweep phase duration
  • GC pause times
  • Goroutines blocked by GC

Programmatic Task and Region Tracking

Use tasks and regions to annotate your trace:

package main

import (
    "context"
    "fmt"
    "os"
    "runtime/trace"
)

func processRequest(ctx context.Context, id int) {
    // Create a task for this logical unit of work
    task := trace.NewTask(ctx, fmt.Sprintf("request-%d", id))
    defer task.End()

    // Create regions within the task
    trace.WithRegion(ctx, "parseInput", func() {
        // Simulate input parsing
        sum := 0
        for i := 0; i < 100000; i++ {
            sum += i
        }
    })

    trace.WithRegion(ctx, "processData", func() {
        // Simulate processing
        sum := 0
        for i := 0; i < 1000000; i++ {
            sum += i
        }
    })

    trace.WithRegion(ctx, "formatOutput", func() {
        // Simulate formatting
        _ = fmt.Sprintf("Result: %d", 42)
    })
}

func main() {
    f, _ := os.Create("trace.out")
    defer f.Close()
    trace.Start(f)
    defer trace.Stop()

    ctx := context.Background()

    // Process multiple requests to see concurrent execution
    for i := 0; i < 50; i++ {
        go processRequest(ctx, i)
    }

    // Wait for completion
    select {}
}

In the trace viewer, click the "Trace" tab to see execution flow with your custom regions highlighted.

Analyzing Goroutine Scheduling

Identify scheduling issues and imbalance:

package main

import (
    "os"
    "runtime/trace"
    "sync"
)

func main() {
    f, _ := os.Create("trace.out")
    defer f.Close()
    trace.Start(f)
    defer trace.Stop()

    var wg sync.WaitGroup
    ch := make(chan int, 1) // Buffered channel creates imbalance

    for i := 0; i < 100; i++ {
        wg.Add(2)

        go func(id int) {
            defer wg.Done()
            // Sender goroutines
            for j := 0; j < 10000; j++ {
                ch <- j
            }
        }(i)

        go func(id int) {
            defer wg.Done()
            // Receiver goroutines
            for j := 0; j < 10000; j++ {
                <-ch
            }
        }(i)
    }

    wg.Wait()
}

The trace reveals:

  • Which goroutines block waiting for channel operations
  • CPU utilization across cores
  • Synchronization bottlenecks

Detecting Goroutine Contention

Find lock contention patterns:

package main

import (
    "context"
    "os"
    "runtime/trace"
    "sync"
    "time"
)

func main() {
    f, _ := os.Create("trace.out")
    defer f.Close()
    trace.Start(f)
    defer trace.Stop()

    var mu sync.Mutex
    var wg sync.WaitGroup

    for i := 0; i < 1000; i++ {
        wg.Add(1)
        go func(id int) {
            defer wg.Done()
            task := trace.NewTask(context.Background(), "criticalSection")
            defer task.End()

            mu.Lock()
            defer mu.Unlock()

            // Critical section
            time.Sleep(1 * time.Millisecond)
        }(i)
    }

    wg.Wait()
}

In the trace:

  • Orange sections show lock blocking
  • Concentration of orange indicates high contention
  • Look for periods where most goroutines are blocked

GC Impact Analysis

Understand how garbage collection affects performance:

package main

import (
    "os"
    "runtime"
    "runtime/trace"
)

func allocateMemory(iterations int) {
    for i := 0; i < iterations; i++ {
        // Small allocations trigger frequent GC
        data := make([]byte, 1024)
        _ = data
    }
}

func main() {
    f, _ := os.Create("trace.out")
    defer f.Close()
    trace.Start(f)
    defer trace.Stop()

    // Show GC impact on concurrent work
    done := make(chan struct{})

    go func() {
        defer func() { done <- struct{}{} }()
        allocateMemory(100000)
    }()

    go func() {
        defer func() { done <- struct{}{} }()
        // Count operations while GC runs
        count := 0
        for i := 0; i < 1000000000; i++ {
            count++
        }
    }()

    <-done
    <-done

    var m runtime.MemStats
    runtime.ReadMemStats(&m)
}

The trace shows:

  • GC pause frequency and duration
  • Impact on other goroutines during GC
  • Relationship between allocation rate and GC frequency

Practical Example: Tracing a Request Pipeline

package main

import (
    "context"
    "fmt"
    "net/http"
    "os"
    "runtime/trace"
    "sync"
    "time"
)

func processRequest(ctx context.Context, id int) string {
    task := trace.NewTask(ctx, fmt.Sprintf("request-%d", id))
    defer task.End()

    var result string

    // Stage 1: Fetch
    trace.WithRegion(ctx, "fetch", func() {
        time.Sleep(10 * time.Millisecond)
        result = fmt.Sprintf("data-%d", id)
    })

    // Stage 2: Transform
    trace.WithRegion(ctx, "transform", func() {
        time.Sleep(5 * time.Millisecond)
        result = result + "-processed"
    })

    // Stage 3: Enrich (parallel sub-tasks)
    trace.WithRegion(ctx, "enrich", func() {
        var wg sync.WaitGroup
        enrichments := make([]string, 3)

        for i := 0; i < 3; i++ {
            wg.Add(1)
            idx := i
            go func() {
                defer wg.Done()
                subTask := trace.NewTask(ctx, fmt.Sprintf("enrich-%d", idx))
                defer subTask.End()

                time.Sleep(3 * time.Millisecond)
                enrichments[idx] = fmt.Sprintf("e%d", idx)
            }()
        }

        wg.Wait()
        for _, e := range enrichments {
            result = result + "-" + e
        }
    })

    // Stage 4: Serialize
    trace.WithRegion(ctx, "serialize", func() {
        time.Sleep(2 * time.Millisecond)
        result = fmt.Sprintf(`{"data": "%s"}`, result)
    })

    return result
}

func main() {
    f, _ := os.Create("trace.out")
    defer f.Close()
    trace.Start(f)
    defer trace.Stop()

    ctx := context.Background()

    // Simulate concurrent requests
    var wg sync.WaitGroup
    for i := 0; i < 100; i++ {
        wg.Add(1)
        go func(id int) {
            defer wg.Done()
            _ = processRequest(ctx, id)
        }(i)
    }

    wg.Wait()

    fmt.Println("Trace complete")
}

Run and analyze:

go run main.go
go tool trace trace.out

In the viewer:

  • Click on request tasks to see execution timeline
  • Identify which stages are bottlenecks
  • See parallelism during enrichment stage
  • Observe scheduling of concurrent requests

Sampling Strategies for Production

Full tracing has overhead. Use sampling for production:

package main

import (
    "math/rand"
    "net/http"
    _ "net/http/pprof"
    "os"
    "runtime/trace"
    "sync/atomic"
    "time"
)

var traceEnabled atomic.Bool

func enableTracing() {
    if traceEnabled.Load() {
        return
    }

    traceEnabled.Store(true)
    f, _ := os.Create("prod-trace.out")
    trace.Start(f)

    // Stop after 10 seconds
    time.AfterFunc(10*time.Second, func() {
        trace.Stop()
        f.Close()
        traceEnabled.Store(false)
    })
}

func handleRequest(w http.ResponseWriter, r *http.Request) {
    // Sample 1% of requests for tracing
    if rand.Float32() < 0.01 && !traceEnabled.Load() {
        enableTracing()
    }

    w.Write([]byte("OK"))
}

func main() {
    http.HandleFunc("/api", handleRequest)
    http.ListenAndServe(":8080", nil)
}

Correlating Traces with pprof Data

Combine tracing and profiling for complete picture:

# Capture both simultaneously
go test -bench=BenchmarkWork \
    -trace=trace.out \
    -cpuprofile=cpu.prof \
    -benchmem

# Analyze trace for scheduling issues
go tool trace trace.out

# Analyze pprof for hotspots
go tool pprof cpu.prof

Use trace viewer to identify when hotspots occur, then use pprof to understand why they're expensive.

Trace Viewer Navigation Tips

  • View Options: Toggle goroutines, processors, network blocking
  • Zoom: Select time range to focus on specific events
  • Search: Find specific goroutine IDs or task names
  • Statistics: View execution stats per goroutine
  • Synchronization: See channel, lock, and semaphore events
  • Execution Timeline: Drag to pan, scroll to zoom on time axis

Common Trace Patterns

Runnable Goroutine Spike: Many goroutines queued for CPU

  • Solution: Reduce concurrency or add backpressure

Blocking During GC: Goroutines suspended by garbage collection

  • Solution: Reduce allocation rate or tune GC with GOGC

Channel Contention: Orange blocking on channel operations

  • Solution: Buffer channels or redesign communication pattern

Lock Contention: Goroutines waiting for mutexes

  • Solution: Reduce critical section size or use lock-free algorithms

Key Takeaways

  • Use runtime/trace to capture detailed execution flow
  • Tasks and regions annotate traces with application semantics
  • Trace viewer reveals scheduling inefficiencies invisible to pprof
  • GC impact becomes visible in traces
  • Sampling reduces production overhead
  • Correlate traces with pprof for complete performance picture
  • Goroutine timeline shows when (not just how much) work happens
  • Identify contention on channels, locks, and I/O operations

On this page