Pointer Optimization Patterns

Understanding pointer dereferences in loops, nil checks, escape analysis, and how to write pointer-heavy code that the compiler can optimize effectively.

Introduction

Pointers in Go are straightforward conceptually, but their use in performance-critical code requires understanding compiler behavior. The Go compiler performs escape analysis, optimizes dereferencing patterns, and applies transformations that dramatically affect performance. This article explores patterns that help the compiler generate optimal code.

Pointer Dereferences in Loops

One of the most impactful micro-optimizations involves hoisting pointer loads out of loops. Consider this pattern:

type Counter struct {
	Value int64
}

func AccumulateViaPointer(counter *Counter, n int) int64 {
	sum := int64(0)
	for i := 0; i < n; i++ {
		sum += counter.Value // Pointer dereference in loop
	}
	return sum
}

func AccumulateViaLocal(counter *Counter, n int) int64 {
	value := counter.Value // Dereference once
	sum := int64(0)
	for i := 0; i < n; i++ {
		sum += value // Use local variable
	}
	return sum
}

The difference is dramatic. The compiler cannot hoist the pointer dereference outside the loop because it might change (or the pointer might be modified by another goroutine). However, when you dereference into a local variable before the loop, the compiler can keep that value in a register:

Loop with pointer dereference:
  MOV RAX, [RBX + offset] // Load from memory each iteration
  ADD RCX, RAX
  JMP loop

Loop with local variable:
  // value already in RCX register (unused memory access)
  ADD RCX, RCX
  JMP loop

Benchmark:

func BenchmarkAccumulateViaPointer(b *testing.B) {
	counter := &Counter{Value: 42}
	b.ResetTimer()
	for i := 0; i < b.N; i++ {
		_ = AccumulateViaPointer(counter, 10000)
	}
}

func BenchmarkAccumulateViaLocal(b *testing.B) {
	counter := &Counter{Value: 42}
	b.ResetTimer()
	for i := 0; i < b.N; i++ {
		_ = AccumulateViaLocal(counter, 10000)
	}
}

Results: AccumulateViaLocal is 5-10x faster on typical CPUs because:

The value stays in a fast register
Memory operations are eliminated
CPU caches are untouched

This is especially pronounced with larger loop bodies where the compiler can better schedule instructions.

Nil Pointer Checks in Loops

The Go compiler inserts nil checks on array access through pointers. These checks happen on every iteration:

func SumArrayViaPointer(arr *[1000]int, n int) int64 {
	sum := int64(0)
	for i := 0; i < n; i++ {
		sum += int64(arr[i]) // Nil check inserted by compiler
	}
	return sum
}

// Generated assembly contains:
// TESTB    AL, AL    ; nil check
// for each array access

You can see the impact with benchmarks:

func BenchmarkArrayViaPointer(b *testing.B) {
	arr := &[1000]int{}
	for i := range arr {
		arr[i] = int(i)
	}
	b.ResetTimer()
	for i := 0; i < b.N; i++ {
		sum := int64(0)
		for j := 0; j < 1000; j++ {
			sum += int64(arr[j])
		}
	}
}

func BenchmarkArrayViaSlice(b *testing.B) {
	arr := &[1000]int{}
	for i := range arr {
		arr[i] = int(i)
	}
	s := arr[:]
	b.ResetTimer()
	for i := 0; i < b.N; i++ {
		sum := int64(0)
		for j := 0; j < 1000; j++ {
			sum += int64(s[j])
		}
	}
}

Results: Slice version is faster because slices don't require nil checks (the slice header itself tells the compiler bounds exist).

Workaround 1: Dereference Before Loop

func SumArrayOptimized(arr *[1000]int, n int) int64 {
	_ = *arr // Nil check here, hoisted
	sum := int64(0)
	for i := 0; i < n; i++ {
		sum += int64(arr[i]) // No nil check in loop
	}
	return sum
}

By dereferencing once before the loop, you move the nil check outside, saving 1000 checks.

Workaround 2: Convert to Slice

func SumArrayAsSlice(arr *[1000]int, n int) int64 {
	s := arr[:] // Convert to slice
	sum := int64(0)
	for i := 0; i < n; i++ {
		sum += int64(s[i])
	}
	return sum
}

The compiler is smarter about slice bounds, eliminating many checks.

Struct Field Access Through Pointers in Loops

Accessing struct fields through pointers in loops causes repeated dereferences:

type Counter struct {
	Value int64
	Name  string
}

func AccumulateStruct(c *Counter, n int) int64 {
	sum := int64(0)
	for i := 0; i < n; i++ {
		sum += c.Value // Pointer dereference for each field access
	}
	return sum
}

func AccumulateStructOptimized(c *Counter, n int) int64 {
	value := c.Value // Dereference once
	sum := int64(0)
	for i := 0; i < n; i++ {
		sum += value
	}
	return sum
}

Generated assembly for the unoptimized version:

loop:
  MOV RAX, [RBX]        ; Load pointer
  MOV RCX, [RAX + 8]    ; Load Value field
  ADD RDX, RCX
  JMP loop

Generated assembly for the optimized version:

  MOV RCX, [RAX + 8]    ; Load Value field once
loop:
  ADD RDX, RCX
  JMP loop

Benchmark:

func BenchmarkStructFieldPointer(b *testing.B) {
	c := &Counter{Value: 42}
	b.ResetTimer()
	for i := 0; i < b.N; i++ {
		_ = AccumulateStruct(c, 10000)
	}
}

func BenchmarkStructFieldOptimized(b *testing.B) {
	c := &Counter{Value: 42}
	b.ResetTimer()
	for i := 0; i < b.N; i++ {
		_ = AccumulateStructOptimized(c, 10000)
	}
}

Results: 5-8x faster for the optimized version.

Escape Analysis and Pointer Parameters

When you pass a value by pointer, the compiler must assume the pointer might escape to the heap. This affects optimization:

// Function with pointer parameter
func ProcessPointer(p *Config) {
	fmt.Println(p.Name)
}

// Function with value parameter
func ProcessValue(c Config) {
	fmt.Println(c.Name)
}

When you call ProcessPointer(&config), the compiler must perform escape analysis:

Does ProcessPointer store the pointer somewhere?
Does it pass the pointer to another function?
Could it leak outside the function?

If the compiler can't prove the pointer doesn't escape, it allocates config on the heap.

// This allocation is forced by escape analysis
var globalConfig *Config

func SetConfig(c *Config) {
	globalConfig = c // Pointer escapes
}

func main() {
	config := Config{Name: "test"}
	SetConfig(&config) // config must be heap-allocated
}

Decision: Value vs Pointer Parameters

Use value parameters when:

Type is small (≤ 4 machine words / 32 bytes on 64-bit)
Function doesn't modify the value
No escape to heap

Use pointer parameters when:

Type is large (> 32 bytes)
Function modifies the value and caller sees changes
Explicit intent to share ownership

type SmallConfig struct {
	A int64
	B int64
	C int64
	D int64
}

type LargeConfig struct {
	Settings [1000]string
	Data     [10000]int64
}

// Small: pass by value
func ProcessSmall(c SmallConfig) {
	// Efficient: value stays on stack
}

// Large: pass by pointer
func ProcessLarge(c *LargeConfig) {
	// Efficient: only pointer (8 bytes) passed
}

Named Return Values vs Anonymous Returns

Named return values can sometimes enable compiler optimizations:

// Anonymous return
func AnonymousReturn() (int, error) {
	result := 0
	err := doWork()
	if err != nil {
		return 0, err
	}
	result = 42
	return result, nil
}

// Named return
func NamedReturn() (result int, err error) {
	err = doWork()
	if err != nil {
		return // Uses named return values
	}
	result = 42
	return
}

Named returns can avoid stack allocation in some cases, but the difference is subtle. The main benefit is clarity and enabling defer to modify returns.

Local Variables and Register Allocation

The compiler tries to keep frequently-used variables in registers. Intermediate calculations should use local variables:

// Sub-optimal: intermediate calculations done repeatedly
func Calculate(p *Point, iterations int) float64 {
	sum := 0.0
	for i := 0; i < iterations; i++ {
		sum += math.Sqrt(float64(p.X*p.X + p.Y*p.Y))
	}
	return sum
}

// Better: cache intermediate values
func CalculateOptimized(p *Point, iterations int) float64 {
	x := float64(p.X)
	y := float64(p.Y)
	sum := 0.0
	for i := 0; i < iterations; i++ {
		sum += math.Sqrt(x*x + y*y)
	}
	return sum
}

The optimized version:

Performs pointer dereferences outside the loop
Caches the conversion to float64
Lets the compiler keep values in registers

Pulling Allocations Out of Hot Paths

When functions allocate memory, pull allocations outside frequently-called code:

// Slow: allocates on every call
func ProcessData(data []int) {
	buf := make([]byte, 1024)
	process(buf)
}

// Fast: allocate once
var buf = make([]byte, 1024)

func ProcessData(data []int) {
	clear(buf)
	process(buf)
}

Benchmark:

func BenchmarkAllocInHotPath(b *testing.B) {
	for i := 0; i < b.N; i++ {
		buf := make([]byte, 1024)
		_ = buf
	}
}

func BenchmarkAllocOutsideHotPath(b *testing.B) {
	buf := make([]byte, 1024)
	b.ResetTimer()
	for i := 0; i < b.N; i++ {
		clear(buf)
		_ = buf
	}
}

Results: Allocating outside is orders of magnitude faster because allocation is expensive.

String Concatenation Pattern

String concatenation in loops should use strings.Builder:

// Slow: creates new string each iteration
func BuildSlow(items []string) string {
	result := ""
	for _, item := range items {
		result += item + ","
	}
	return result
}

// Fast: single allocation
func BuildFast(items []string) string {
	var b strings.Builder
	for i, item := range items {
		if i > 0 {
			b.WriteString(",")
		}
		b.WriteString(item)
	}
	return b.String()
}

Benchmark:

func BenchmarkStringConcatLoop(b *testing.B) {
	items := []string{"apple", "banana", "cherry"}
	b.ResetTimer()
	for i := 0; i < b.N; i++ {
		result := ""
		for _, item := range items {
			result += item + ","
		}
	}
}

func BenchmarkStringBuilder(b *testing.B) {
	items := []string{"apple", "banana", "cherry"}
	b.ResetTimer()
	for i := 0; i < b.N; i++ {
		var sb strings.Builder
		for _, item := range items {
			sb.WriteString(item)
			sb.WriteString(",")
		}
		_ = sb.String()
	}
}

Results: Builder is 100x+ faster for large strings because it uses exponential growth instead of copying.

Memory Access Patterns and Cache

Beyond the compiler, CPU cache behavior affects performance:

// Sequential memory access (cache-friendly)
func SumSequential(data []int) int {
	sum := 0
	for i := 0; i < len(data); i++ {
		sum += data[i] // Predictable, sequential
	}
	return sum
}

// Random access (cache-hostile)
func SumRandom(data []int, indices []int) int {
	sum := 0
	for _, idx := range indices {
		sum += data[idx] // Unpredictable, random
	}
	return sum
}

The sequential version is typically 10-100x faster depending on data size, because:

CPU prefetcher predicts the pattern
Cache hits are nearly 100%
Memory bandwidth is efficiently used

The random version suffers cache misses, memory stalls, and no prefetching benefit.

Practical Optimization Pattern: Hot Path Extraction

Identify your hot path and optimize aggressively:

// Typical pattern: 99% of time in this loop
func ProcessMillionsOfItems(items []*Item) {
	for _, item := range items {
		sum += processItem(item)
	}
}

// Optimization: hoist pointer dereferences
func ProcessItemsOptimized(items []*Item) {
	sum := int64(0)
	for i := 0; i < len(items); i++ {
		item := items[i]
		// Cache-friendly, pointer dereferenced once per iteration
		sum += item.Value
	}
}

Profile first to confirm where time is actually spent.

Summary and Recommendations

Hoist pointer dereferences: Load values into local variables before loops. Saves 5-10x.
Move nil checks: Use _ = *ptr or convert to slice to move nil checks out of loops.
Cache struct fields: Load struct fields into local variables before hot loops.
Choose value vs pointer wisely: Value under 32 bytes (usually): pass by value. Value over 32 bytes or needs mutation: use pointer.
Pull allocations out: Allocate once outside hot paths, reuse inside.
Use Builder for strings: Always use strings.Builder for string concatenation in loops.
Consider memory layout: Sequential access is 10-100x faster than random access.
Profile before optimizing: These patterns are high-impact, but profile to confirm they're in your hot path.
Trust escape analysis: Modern Go is good at optimizing, but help it by:
- Using small value types when possible
- Making escape obvious (or impossible) to the compiler
- Avoiding unnecessary allocations
Measure with benchmarks: Small optimizations add up. Use go test -bench to verify improvements.

The key principle: The compiler is your partner. Understand what the compiler can and can't optimize, and write code that makes its job easier. Simple changes like hoisting pointer loads can deliver 5-10x performance improvements with no change to algorithm or data structure.

On this page