Bounds Check Elimination
Master how the Go compiler eliminates array and slice bounds checks to improve performance and reduce CPU pipeline stalls.
Understanding Bounds Checks in Go
Every time you access an element in a slice or array using an index in Go, the runtime must verify that the index is within bounds. Without this check, an out-of-bounds access would cause a panic. By default, the Go compiler inserts a bounds check instruction for every indexed access.
A bounds check is a conditional branch that verifies 0 <= index < len(array). While this safety guarantee is essential, bounds checks carry a performance cost that compounds in tight loops.
The Performance Cost of Bounds Checks
The performance impact manifests in several ways:
-
Branch Prediction Overhead: Modern CPUs use branch prediction to speculatively execute code. A frequently mispredicted branch can cause a pipeline flush, stalling execution for 10-20 cycles.
-
Code Size: Each bounds check adds extra instructions, increasing code size and reducing instruction cache efficiency.
-
Dependent Instructions: If the next instruction depends on the result of the bounds check, it must wait for the check to complete, creating a pipeline dependency.
-
Out-of-Order Execution Stalls: On out-of-order execution CPUs, a speculative bounds check failure can require replaying instructions, reducing throughput.
Consider a simple loop that processes 1 billion array elements. Each iteration executes a bounds check, which typically costs 1-3 CPU cycles on modern hardware. For 1 billion iterations, that's billions of wasted cycles.
How the Go Compiler Eliminates Bounds Checks
The Go compiler uses sophisticated static analysis to prove that certain bounds checks are unnecessary and removes them. This optimization happens in the "prove" pass of the compiler's SSA (Static Single Assignment) form.
The Prove Pass and SSA
The Go compiler converts source code into an intermediate representation called SSA. The prove pass analyzes the control flow graph and value ranges to determine which bounds checks can be eliminated. When the compiler can mathematically prove that an index is always within bounds, it removes the check.
To see what bounds checks the compiler is eliminating, use the following build flag:
go build -gcflags="-d=ssa/check_bce" ./...This produces output like:
./main.go:15:11: Found 3 bounds checks that can be eliminated
./main.go:20:9: Cannot eliminate bounds check (see issue #123)Common BCE Patterns
Pattern 1: Length Check Before Loop
The most explicit pattern is checking the slice length before accessing elements:
func processSlice(s []int) int64 {
sum := int64(0)
n := 5
if len(s) >= n {
for i := 0; i < n; i++ {
sum += int64(s[i]) // Bounds check eliminated
}
}
return sum
}When the compiler sees if len(s) >= n, it records that within that block, accesses to s[0] through s[n-1] are safe. The length check must come before the loop, and the loop must respect the proven bound.
Pattern 2: Range-Based Iteration
Range loops are particularly efficient for bounds check elimination:
func sumRange(s []int) int64 {
sum := int64(0)
for i := range s { // Loop variable always in bounds
sum += int64(s[i]) // Bounds check eliminated
}
return sum
}The compiler knows that range iteration on a slice always produces valid indices from 0 to len(s)-1. No bounds check is needed.
Pattern 3: Masking Technique for Power-of-2 Slices
For slices with lengths that are powers of 2, bitwise masking provides efficient bounds checking:
func indexPowerOf2(s []int, i uint32) int {
// Assumes len(s) is a power of 2 (e.g., 1024)
if len(s)&(len(s)-1) == 0 {
return s[i&uint32(len(s)-1)] // Bounds check can be eliminated
}
return 0
}When len(s) is a power of 2, i & (len(s)-1) always produces a value less than len(s). The compiler can recognize this pattern and eliminate the bounds check.
Pattern 4: Accessing Last Element First
A clever trick is accessing the last element before the first:
func processAllElements(s []int) int64 {
sum := int64(0)
if len(s) > 0 {
_ = s[len(s)-1] // Prove all elements are valid
for i := 0; i < len(s); i++ {
sum += int64(s[i]) // Bounds check eliminated
}
}
return sum
}Accessing s[len(s)-1] proves that the slice has at least 1 element, and all indices from 0 to len(s)-1 are valid.
Pattern 5: Sorted Decreasing Access
The compiler can eliminate bounds checks when you access elements in decreasing index order:
func processBackwards(s []int) int64 {
sum := int64(0)
for i := len(s) - 1; i >= 0; i-- {
sum += int64(s[i]) // Bounds check can be eliminated
}
return sum
}Real-World Examples from the Standard Library
The Go standard library uses BCE patterns extensively. Let's examine practical examples:
Binary Encoding: Decoding uint32 from Bytes
// From encoding/binary (simplified)
func littleEndianUint32(b []byte) uint32 {
// The bounds check on b[3] proves b[0], b[1], b[2], b[3] are valid
_ = b[3]
return uint32(b[0]) | uint32(b[1])<<8 | uint32(b[2])<<16 | uint32(b[3])<<24
}The compiler sees the access to b[3] and knows that b[0], b[1], and b[2] are guaranteed to be in bounds. It eliminates checks for those accesses.
AES Encryption: State Array Indexing
// Simplified from crypto/aes
func mixColumns(state []byte) {
// Prove the slice is long enough
if len(state) < 16 {
return
}
// Now the compiler knows accessing state[0] through state[15] is safe
for i := 0; i < 16; i += 4 {
state[i] = gfMul2[state[i]] ^ gfMul3[state[i+1]] ^ state[i+2] ^ state[i+3]
state[i+1] = state[i] ^ gfMul2[state[i+1]] ^ gfMul3[state[i+2]] ^ state[i+3]
// ...
}
}When BCE Doesn't Work
Bounds check elimination has limitations. Understanding these helps you write code that the compiler can optimize:
Computed Indices
func processComputed(s []int) int64 {
sum := int64(0)
// The compiler cannot prove this index is always < len(s)
for i := 0; i < len(s); i++ {
idx := computeIndex(i) // Unknown return value
sum += int64(s[idx]) // Bounds check NOT eliminated
}
return sum
}When the index comes from a function call, the compiler has no information about its range.
Unprovable Relationships
func processUnprovable(s []int, t []int) int64 {
sum := int64(0)
for i := range s {
// Compiler doesn't know relationship between len(s) and len(t)
sum += int64(t[i]) // Bounds check NOT eliminated
}
return sum
}The compiler cannot assume that len(t) >= len(s) without explicit proof.
Benchmarking Bounds Check Elimination
Let's create a comprehensive benchmark to measure the performance impact:
package main
import (
"testing"
)
const size = 1000
func BenchmarkBCEEnabled(b *testing.B) {
s := make([]int64, size)
for i := range s {
s[i] = int64(i)
}
b.ResetTimer()
for i := 0; i < b.N; i++ {
sum := int64(0)
// Prove length once
if len(s) >= size {
for j := 0; j < size; j++ {
sum += s[j] // Bounds check eliminated
}
}
_ = sum
}
}
func BenchmarkBCEDisabled(b *testing.B) {
s := make([]int64, size)
for i := range s {
s[i] = int64(i)
}
// Create a computed length to prevent BCE
l := len(s)
b.ResetTimer()
for i := 0; i < b.N; i++ {
sum := int64(0)
for j := 0; j < l; j++ {
// The compiler cannot prove this because 'l' is computed
sum += s[j] // Bounds check NOT eliminated
}
_ = sum
}
}
func BenchmarkRangeLoop(b *testing.B) {
s := make([]int64, size)
for i := range s {
s[i] = int64(i)
}
b.ResetTimer()
for i := 0; i < b.N; i++ {
sum := int64(0)
for _, v := range s {
sum += v // Bounds check eliminated
}
_ = sum
}
}Expected Results
Running these benchmarks on a modern CPU typically shows:
BenchmarkBCEEnabled-8 1000000000 0.52 ns/op 0 B/op 0 allocs/op
BenchmarkBCEDisabled-8 500000000 1.84 ns/op 0 B/op 0 allocs/op
BenchmarkRangeLoop-8 1000000000 0.51 ns/op 0 B/op 0 allocs/opThe bounds check can add 1-3 nanoseconds per iteration, which is significant for hot loops processing millions of elements.
Assembly Output Comparison
To see exactly how bounds check elimination works, examine the generated assembly code:
go tool compile -S main.go > output.sA loop without BCE elimination includes instructions like:
CMPQ AX, BX ; Compare index with length
JLS bounds_error ; Jump if index >= length
MOVQ (CX)(AX*8), DX ; Load from sliceWith BCE elimination, those comparison and jump instructions disappear:
MOVQ (CX)(AX*8), DX ; Load from slice directlyThis saves 2-3 CPU instructions and eliminates a branch prediction opportunity.
Advanced BCE Pattern: Crypto Hash Updates
func updateHash(h *[64]byte, data []byte) {
// Prove we can access all bytes needed
if len(data) < 64 {
return
}
// All these accesses have bounds checks eliminated
for i := 0; i < 64; i++ {
h[i] ^= data[i]
}
}Guidelines for Writing BCE-Friendly Code
Tip: The prove pass in Go 1.21+ is more sophisticated than earlier versions. Test your code with the actual Go version you're targeting.
-
Check length explicitly before loops: Use
if len(s) >= n { ... }patterns. -
Prefer range loops: They're always optimized and more readable.
-
Access boundary elements first:
_ = s[len(s)-1]proves the slice's validity. -
Keep index computation simple: Avoid function calls or complex arithmetic in the index expression.
-
Use
-d=ssa/check_bceduring development: Verify that the compiler is eliminating the checks you expect. -
Profile your code: Use
go tool pprofto identify tight loops where BCE matters. -
Consider the cost/benefit: For loops running few iterations, BCE provides negligible benefit. Focus on loops processing millions of elements.
Limitations and Future Improvements
The current prove pass in Go has limitations:
- It cannot track relationships between different slice lengths.
- It struggles with complex control flow.
- Some patterns that are provably safe are still not recognized.
The Go team continues to improve bounds check elimination. Keep an eye on golang.org/issue/28035 for ongoing discussions.
Summary
Bounds check elimination is a powerful optimization that can provide 2-4x speedup for tight loops. By understanding the patterns that the Go compiler recognizes, you can write code that compiles to faster machine code. Use -d=ssa/check_bce to verify elimination is happening, and focus optimization efforts on loops that process significant amounts of data.
The key takeaway: make it easy for the compiler to prove bounds safety by using explicit length checks, range loops, and simple index expressions.