Bounds Check Elimination

Master how the Go compiler eliminates array and slice bounds checks to improve performance and reduce CPU pipeline stalls.

Understanding Bounds Checks in Go

Every time you access an element in a slice or array using an index in Go, the runtime must verify that the index is within bounds. Without this check, an out-of-bounds access would cause a panic. By default, the Go compiler inserts a bounds check instruction for every indexed access.

A bounds check is a conditional branch that verifies 0 <= index < len(array). While this safety guarantee is essential, bounds checks carry a performance cost that compounds in tight loops.

The Performance Cost of Bounds Checks

The performance impact manifests in several ways:

Branch Prediction Overhead: Modern CPUs use branch prediction to speculatively execute code. A frequently mispredicted branch can cause a pipeline flush, stalling execution for 10-20 cycles.
Code Size: Each bounds check adds extra instructions, increasing code size and reducing instruction cache efficiency.
Dependent Instructions: If the next instruction depends on the result of the bounds check, it must wait for the check to complete, creating a pipeline dependency.
Out-of-Order Execution Stalls: On out-of-order execution CPUs, a speculative bounds check failure can require replaying instructions, reducing throughput.

Consider a simple loop that processes 1 billion array elements. Each iteration executes a bounds check, which typically costs 1-3 CPU cycles on modern hardware. For 1 billion iterations, that's billions of wasted cycles.

How the Go Compiler Eliminates Bounds Checks

The Go compiler uses sophisticated static analysis to prove that certain bounds checks are unnecessary and removes them. This optimization happens in the "prove" pass of the compiler's SSA (Static Single Assignment) form.

The Prove Pass and SSA

The Go compiler converts source code into an intermediate representation called SSA. The prove pass analyzes the control flow graph and value ranges to determine which bounds checks can be eliminated. When the compiler can mathematically prove that an index is always within bounds, it removes the check.

To see what bounds checks the compiler is eliminating, use the following build flag:

go build -gcflags="-d=ssa/check_bce" ./...

This produces output like:

./main.go:15:11: Found 3 bounds checks that can be eliminated
./main.go:20:9: Cannot eliminate bounds check (see issue #123)

Common BCE Patterns

Pattern 1: Length Check Before Loop

The most explicit pattern is checking the slice length before accessing elements:

func processSlice(s []int) int64 {
    sum := int64(0)
    n := 5
    if len(s) >= n {
        for i := 0; i < n; i++ {
            sum += int64(s[i])  // Bounds check eliminated
        }
    }
    return sum
}

When the compiler sees if len(s) >= n, it records that within that block, accesses to s[0] through s[n-1] are safe. The length check must come before the loop, and the loop must respect the proven bound.

Pattern 2: Range-Based Iteration

Range loops are particularly efficient for bounds check elimination:

func sumRange(s []int) int64 {
    sum := int64(0)
    for i := range s {  // Loop variable always in bounds
        sum += int64(s[i])  // Bounds check eliminated
    }
    return sum
}

The compiler knows that range iteration on a slice always produces valid indices from 0 to len(s)-1. No bounds check is needed.

Pattern 3: Masking Technique for Power-of-2 Slices

For slices with lengths that are powers of 2, bitwise masking provides efficient bounds checking:

func indexPowerOf2(s []int, i uint32) int {
    // Assumes len(s) is a power of 2 (e.g., 1024)
    if len(s)&(len(s)-1) == 0 {
        return s[i&uint32(len(s)-1)]  // Bounds check can be eliminated
    }
    return 0
}

When len(s) is a power of 2, i & (len(s)-1) always produces a value less than len(s). The compiler can recognize this pattern and eliminate the bounds check.

Pattern 4: Accessing Last Element First

A clever trick is accessing the last element before the first:

func processAllElements(s []int) int64 {
    sum := int64(0)
    if len(s) > 0 {
        _ = s[len(s)-1]  // Prove all elements are valid
        for i := 0; i < len(s); i++ {
            sum += int64(s[i])  // Bounds check eliminated
        }
    }
    return sum
}

Accessing s[len(s)-1] proves that the slice has at least 1 element, and all indices from 0 to len(s)-1 are valid.

Pattern 5: Sorted Decreasing Access

The compiler can eliminate bounds checks when you access elements in decreasing index order:

func processBackwards(s []int) int64 {
    sum := int64(0)
    for i := len(s) - 1; i >= 0; i-- {
        sum += int64(s[i])  // Bounds check can be eliminated
    }
    return sum
}

Real-World Examples from the Standard Library

The Go standard library uses BCE patterns extensively. Let's examine practical examples:

Binary Encoding: Decoding uint32 from Bytes

// From encoding/binary (simplified)
func littleEndianUint32(b []byte) uint32 {
    // The bounds check on b[3] proves b[0], b[1], b[2], b[3] are valid
    _ = b[3]
    return uint32(b[0]) | uint32(b[1])<<8 | uint32(b[2])<<16 | uint32(b[3])<<24
}

The compiler sees the access to b[3] and knows that b[0], b[1], and b[2] are guaranteed to be in bounds. It eliminates checks for those accesses.

AES Encryption: State Array Indexing

// Simplified from crypto/aes
func mixColumns(state []byte) {
    // Prove the slice is long enough
    if len(state) < 16 {
        return
    }
    // Now the compiler knows accessing state[0] through state[15] is safe
    for i := 0; i < 16; i += 4 {
        state[i] = gfMul2[state[i]] ^ gfMul3[state[i+1]] ^ state[i+2] ^ state[i+3]
        state[i+1] = state[i] ^ gfMul2[state[i+1]] ^ gfMul3[state[i+2]] ^ state[i+3]
        // ...
    }
}

When BCE Doesn't Work

Bounds check elimination has limitations. Understanding these helps you write code that the compiler can optimize:

Computed Indices

func processComputed(s []int) int64 {
    sum := int64(0)
    // The compiler cannot prove this index is always < len(s)
    for i := 0; i < len(s); i++ {
        idx := computeIndex(i)  // Unknown return value
        sum += int64(s[idx])     // Bounds check NOT eliminated
    }
    return sum
}

When the index comes from a function call, the compiler has no information about its range.

Unprovable Relationships

func processUnprovable(s []int, t []int) int64 {
    sum := int64(0)
    for i := range s {
        // Compiler doesn't know relationship between len(s) and len(t)
        sum += int64(t[i])  // Bounds check NOT eliminated
    }
    return sum
}

The compiler cannot assume that len(t) >= len(s) without explicit proof.

Benchmarking Bounds Check Elimination

Let's create a comprehensive benchmark to measure the performance impact:

package main

import (
    "testing"
)

const size = 1000

func BenchmarkBCEEnabled(b *testing.B) {
    s := make([]int64, size)
    for i := range s {
        s[i] = int64(i)
    }

    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        sum := int64(0)
        // Prove length once
        if len(s) >= size {
            for j := 0; j < size; j++ {
                sum += s[j]  // Bounds check eliminated
            }
        }
        _ = sum
    }
}

func BenchmarkBCEDisabled(b *testing.B) {
    s := make([]int64, size)
    for i := range s {
        s[i] = int64(i)
    }

    // Create a computed length to prevent BCE
    l := len(s)

    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        sum := int64(0)
        for j := 0; j < l; j++ {
            // The compiler cannot prove this because 'l' is computed
            sum += s[j]  // Bounds check NOT eliminated
        }
        _ = sum
    }
}

func BenchmarkRangeLoop(b *testing.B) {
    s := make([]int64, size)
    for i := range s {
        s[i] = int64(i)
    }

    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        sum := int64(0)
        for _, v := range s {
            sum += v  // Bounds check eliminated
        }
        _ = sum
    }
}

Expected Results

Running these benchmarks on a modern CPU typically shows:

BenchmarkBCEEnabled-8     1000000000   0.52 ns/op   0 B/op   0 allocs/op
BenchmarkBCEDisabled-8     500000000   1.84 ns/op   0 B/op   0 allocs/op
BenchmarkRangeLoop-8      1000000000   0.51 ns/op   0 B/op   0 allocs/op

The bounds check can add 1-3 nanoseconds per iteration, which is significant for hot loops processing millions of elements.

Assembly Output Comparison

To see exactly how bounds check elimination works, examine the generated assembly code:

go tool compile -S main.go > output.s

A loop without BCE elimination includes instructions like:

CMPQ    AX, BX      ; Compare index with length
JLS     bounds_error ; Jump if index >= length
MOVQ    (CX)(AX*8), DX ; Load from slice

With BCE elimination, those comparison and jump instructions disappear:

MOVQ    (CX)(AX*8), DX ; Load from slice directly

This saves 2-3 CPU instructions and eliminates a branch prediction opportunity.

Advanced BCE Pattern: Crypto Hash Updates

func updateHash(h *[64]byte, data []byte) {
    // Prove we can access all bytes needed
    if len(data) < 64 {
        return
    }

    // All these accesses have bounds checks eliminated
    for i := 0; i < 64; i++ {
        h[i] ^= data[i]
    }
}

Guidelines for Writing BCE-Friendly Code

Tip: The prove pass in Go 1.21+ is more sophisticated than earlier versions. Test your code with the actual Go version you're targeting.

Check length explicitly before loops: Use if len(s) >= n { ... } patterns.
Prefer range loops: They're always optimized and more readable.
Access boundary elements first: _ = s[len(s)-1] proves the slice's validity.
Keep index computation simple: Avoid function calls or complex arithmetic in the index expression.
Use -d=ssa/check_bce during development: Verify that the compiler is eliminating the checks you expect.
Profile your code: Use go tool pprof to identify tight loops where BCE matters.
Consider the cost/benefit: For loops running few iterations, BCE provides negligible benefit. Focus on loops processing millions of elements.

Limitations and Future Improvements

The current prove pass in Go has limitations:

It cannot track relationships between different slice lengths.
It struggles with complex control flow.
Some patterns that are provably safe are still not recognized.

The Go team continues to improve bounds check elimination. Keep an eye on golang.org/issue/28035 for ongoing discussions.

Summary

Bounds check elimination is a powerful optimization that can provide 2-4x speedup for tight loops. By understanding the patterns that the Go compiler recognizes, you can write code that compiles to faster machine code. Use -d=ssa/check_bce to verify elimination is happening, and focus optimization efforts on loops that process significant amounts of data.

The key takeaway: make it easy for the compiler to prove bounds safety by using explicit length checks, range loops, and simple index expressions.

On this page