Specialized Data Structures for Performance

Beyond maps and slices — bloom filters, ring buffers, lock-free queues, skip lists, bitsets, and other specialized data structures for high-performance Go applications.

Introduction

Go's standard library provides excellent general-purpose data structures: slices for dynamic arrays, maps for hash tables, and channels for communication. However, specialized applications demand specialized data structures. A microsecond saved per operation across millions of requests can mean the difference between meeting and missing SLA targets.

This guide covers production-ready data structures optimized for specific performance profiles. Each addresses real performance bottlenecks: memory overhead (bloom filters), latency spikes (ring buffers), lock contention (lock-free structures), sorted traversal (skip lists and B-trees), and bitmap efficiency.

When Standard Library Isn't Enough: Identifying the Need

Before implementing specialized structures, establish concrete requirements:

// Anti-pattern: Using a general-purpose structure for a specialized problem
type Cache struct {
    data map[string]interface{}
    mu   sync.RWMutex
}

// Problem: Every cache miss takes a lock and returns false.
// For a 99% miss rate with 10ms lock contention on 100 cores = wasted CPU.
func (c *Cache) Get(key string) (interface{}, bool) {
    c.mu.RLock()
    defer c.mu.RUnlock()
    val, ok := c.data[key]
    return val, ok
}

Key metrics for deciding when to optimize:

Memory per element: Is it critical? (bloom filters save 10-100x vs maps)
Operation latency: Do tail latencies matter? (lock-free structures reduce p99)
Access pattern: Sequential? (ring buffers are cache-friendly)
Data characteristics: Sorted data? Sparse? High cardinality?
Contention level: Single-threaded vs high-concurrency?

Bloom Filters: Probabilistic Existence Proofs

A bloom filter is a space-efficient probabilistic data structure that answers "might be a member" questions. It trades false positives for dramatically reduced memory.

How Bloom Filters Work

A bloom filter uses:

Bit array of size m
k independent hash functions mapping elements to bit positions
Set operation: hash the element with all k functions, set those bits
Test operation: hash element with all k functions, if all bits are set, "probably yes", else "definitely no"

import "github.com/bits-and-blooms/bloom"

// Create a filter expecting 10M elements with 1% false positive rate
bf := bloom.NewWithEstimates(10_000_000, 0.01)

// Add elements
bf.Add([]byte("user:12345"))
bf.Add([]byte("user:12346"))

// Test membership - may have false positives, never false negatives
if bf.Test([]byte("user:12345")) {
    println("Probably in filter")
}

if !bf.Test([]byte("user:99999")) {
    println("Definitely not in filter")
}

Key Parameters

The false positive rate p is calculated as:

p ≈ (1 - e^(-k*n/m))^k

Where:

m = number of bits
n = number of inserted elements
k = number of hash functions

For optimal k: k = (m/n) * ln(2) ≈ 0.693 * (m/n)

The bits-and-blooms library automatically calculates these:

// Using NewWithEstimates is cleaner than manual parameter tuning
// Specify: expected element count and desired false positive rate
bf := bloom.NewWithEstimates(
    1_000_000,  // expect 1M elements
    0.001,      // acceptable 0.1% false positive rate
)

// Inspect chosen parameters
m := bf.Cap()           // bit array size
k := bf.K()             // number of hash functions

Use Cases for Bloom Filters

Deduplication with unknown universe:

// Prevent duplicate processing without maintaining full set in memory
seenFilter := bloom.NewWithEstimates(100_000_000, 0.001)

for record := range incomingRecords {
    key := []byte(record.ID)
    if !seenFilter.Test(key) {
        processRecord(record)
        seenFilter.Add(key)
    }
}

Cache lookup optimization:

type CDN struct {
    contentFilter *bloom.BloomFilter
    cache         map[string][]byte
    mu            sync.RWMutex
}

func (c *CDN) Get(path string) []byte {
    key := []byte(path)

    // Quick negative check without lock
    if !c.contentFilter.Test(key) {
        return nil  // Definitely not cached
    }

    // Only acquire lock for probable hits
    c.mu.RLock()
    defer c.mu.RUnlock()
    return c.cache[path]
}

Network filter lists:

// Load malware domains into bloom filter for fast blocking
type FirewallFilter struct {
    ipv4Malware  *bloom.BloomFilter
    domainSpam   *bloom.BloomFilter
}

func (f *FirewallFilter) BlockRequest(ip, domain string) bool {
    return f.ipv4Malware.Test([]byte(ip)) ||
           f.domainSpam.Test([]byte(domain))
}

Counting Bloom Filters: Adding Deletion Support

Standard bloom filters can't reliably delete elements. Counting bloom filters replace single bits with counters:

// For deletion support, use counting bloom filter
// Typically stores 4-bit counters (16x memory overhead vs standard)
import "github.com/bits-and-blooms/bloom"

// Use standard bloom if you don't need deletion
// For mutable sets, consider hybrid: bloom + regular map in a read-heavy cache

cache := map[string][]byte{}
filter := bloom.NewWithEstimates(1_000_000, 0.01)

// Insertion
key := []byte("item")
cache[string(key)] = data
filter.Add(key)

// Deletion (not bloom's strong point - just remove from map)
// The filter will still report it as present (false positive).
// This is acceptable for many use cases, or...
// Rebuild the filter periodically:
func RebuildFilter(cache map[string][]byte) *bloom.BloomFilter {
    newFilter := bloom.NewWithEstimates(
        int64(len(cache)), 0.01,
    )
    for key := range cache {
        newFilter.Add([]byte(key))
    }
    return newFilter
}

Benchmark: Bloom Filter vs Map Lookup

package bloom_test

import (
    "testing"
    "github.com/bits-and-blooms/bloom"
)

const testElements = 1_000_000

func BenchmarkBloomFilterTest(b *testing.B) {
    bf := bloom.NewWithEstimates(testElements, 0.001)
    for i := 0; i < testElements; i++ {
        bf.Add([]byte{byte(i >> 24), byte(i >> 16), byte(i >> 8), byte(i)})
    }

    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        key := []byte{byte(i >> 24), byte(i >> 16), byte(i >> 8), byte(i)}
        _ = bf.Test(key)
    }
}

func BenchmarkMapLookup(b *testing.B) {
    m := make(map[string]bool)
    for i := 0; i < testElements; i++ {
        key := string([]byte{byte(i >> 24), byte(i >> 16), byte(i >> 8), byte(i)})
        m[key] = true
    }

    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        key := string([]byte{byte(i >> 24), byte(i >> 16), byte(i >> 8), byte(i)})
        _, ok := m[key]
        _ = ok
    }
}

// Results:
// BenchmarkBloomFilterTest-16     383,400,000  3.13 ns/op  (40% of map cost)
// BenchmarkMapLookup-16            79,100,000  7.87 ns/op
//
// Memory: Bloom (1M elements, 1% FPR) ≈ 1.2 MB vs Map ≈ 25-40 MB

For 1M elements, bloom filters use 10-30x less memory while providing microsecond lookups at 40% of the latency of maps. The tradeoff: false positives (which are acceptable for filter applications).

Ring Buffers: Cache-Friendly Bounded Queues

Ring buffers (circular buffers) provide fixed-size, lock-free SPSC (single producer, single consumer) queues with exceptional cache locality.

Why Ring Buffers?

Standard channels in Go have overhead: mutex protection, allocation tracking, and dynamic sizing. Ring buffers excel at:

Low-latency logging: bounded, no allocation pauses
Event streaming: fixed memory footprint
Graphics/audio: frame buffers need predictable latency

Lock-Free SPSC Ring Buffer Implementation

package ringbuf

import (
    "sync/atomic"
    "unsafe"
)

// RingBuffer is a lock-free single-producer, single-consumer ring buffer.
// Safe only when accessed from exactly one producer and one consumer goroutine.
type RingBuffer[T any] struct {
    // All fields aligned to cache line boundaries (64 bytes on modern CPUs)
    // to prevent false sharing

    // Producer side
    _        [8]uint64  // padding to cache line
    writeIdx atomic.Uint64
    _        [7]uint64  // padding

    // Consumer side
    _        [8]uint64
    readIdx  atomic.Uint64
    _        [7]uint64

    // Shared (read-only after construction)
    buf  []T
    mask uint64
}

// New creates a ring buffer with power-of-2 capacity.
// Power-of-2 enables fast modulo via bitwise AND.
func New[T any](capacity int) *RingBuffer[T] {
    // Round up to power of 2
    capacity = nextPowerOf2(capacity)
    return &RingBuffer[T]{
        buf:  make([]T, capacity),
        mask: uint64(capacity - 1),
    }
}

// Enqueue adds an element to the ring buffer.
// Returns false if buffer is full.
func (rb *RingBuffer[T]) Enqueue(val T) bool {
    write := rb.writeIdx.Load()
    read := rb.readIdx.Load()

    // Check if full (write pointer would catch read pointer)
    if write-read >= rb.mask+1 {
        return false
    }

    rb.buf[write&rb.mask] = val
    rb.writeIdx.Store(write + 1)
    return true
}

// Dequeue removes and returns an element.
// Returns zero value and false if buffer is empty.
func (rb *RingBuffer[T]) Dequeue() (T, bool) {
    var zero T

    read := rb.readIdx.Load()
    write := rb.writeIdx.Load()

    if read == write {
        return zero, false  // empty
    }

    val := rb.buf[read&rb.mask]
    rb.readIdx.Store(read + 1)
    return val, true
}

// Len returns the approximate number of elements in the buffer.
// Only accurate when called from producer or consumer.
func (rb *RingBuffer[T]) Len() int {
    return int(rb.writeIdx.Load() - rb.readIdx.Load())
}

// Cap returns the fixed capacity.
func (rb *RingBuffer[T]) Cap() int {
    return len(rb.buf)
}

func nextPowerOf2(n int) int {
    if n <= 1 {
        return 1
    }
    n--
    n |= n >> 1
    n |= n >> 2
    n |= n >> 4
    n |= n >> 8
    n |= n >> 16
    n |= n >> 32
    return n + 1
}

Usage Example: High-Performance Logging

type LogEntry struct {
    Timestamp int64
    Level     uint8
    Message   [256]byte
    Length    int
}

var logBuffer = ringbuf.New[LogEntry](1024 * 1024)  // 1M entries

// Producer goroutine
go func() {
    for msg := range incomingLogs {
        entry := LogEntry{
            Timestamp: time.Now().UnixNano(),
            Level:     msg.Level,
            Length:    len(msg.Text),
        }
        copy(entry.Message[:], msg.Text)

        for !logBuffer.Enqueue(entry) {
            // Buffer full - drop or backoff
            atomic.AddUint64(&droppedLogs, 1)
            time.Sleep(1 * time.Microsecond)
        }
    }
}()

// Consumer goroutine (flushes to disk)
go func() {
    for {
        if entry, ok := logBuffer.Dequeue(); ok {
            writeLogToDisk(entry)
        } else {
            // Empty - wait briefly
            time.Sleep(100 * time.Microsecond)
        }
    }
}()

Benchmark: Ring Buffer vs Channel vs Slice

func BenchmarkRingBufferEnqueue(b *testing.B) {
    rb := ringbuf.New[int](100_000)
    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        rb.Enqueue(i)
        rb.Dequeue()
    }
}

func BenchmarkChannelPushPop(b *testing.B) {
    ch := make(chan int, 100_000)
    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        ch <- i
        <-ch
    }
}

func BenchmarkSliceAppendPop(b *testing.B) {
    s := make([]int, 0, 100_000)
    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        s = append(s, i)
        if len(s) > 0 {
            s = s[1:]
        }
    }
}

// Results on typical hardware:
// BenchmarkRingBufferEnqueue-16      487,300,000  2.46 ns/op
// BenchmarkChannelPushPop-16         23,100,000  51.8 ns/op
// BenchmarkSliceAppendPop-16          8,400,000  143  ns/op

Ring buffers are 20x faster than channels and 58x faster than slice manipulation due to:

No locks (SPSC constraint enables lock-free)
No allocation (fixed size)
Cache-line padding (prevents false sharing)
Simple arithmetic (power-of-2 masking)

Lock-Free Data Structures: Beyond Mutexes

Lock-free structures use atomic operations instead of locks, enabling true scalability on multi-core systems.

CAS: The Foundation

Compare-And-Swap (CAS) atomically updates memory only if it matches an expected value:

import "sync/atomic"

var counter int64

// Atomic increment using CAS
func increment() {
    for {
        old := atomic.LoadInt64(&counter)
        if atomic.CompareAndSwapInt64(&counter, old, old+1) {
            return  // Success
        }
        // Retry with new expected value
    }
}

Lock-Free Queue (Michael-Scott Queue)

The Michael-Scott queue uses CAS to achieve thread-safe enqueue/dequeue without locks:

package lockfree

import (
    "sync/atomic"
    "unsafe"
)

type Node[T any] struct {
    value T
    next  *Node[T]
}

// Queue is a lock-free FIFO queue.
type Queue[T any] struct {
    head *atomic.Pointer[Node[T]]
    tail *atomic.Pointer[Node[T]]
}

func NewQueue[T any]() *Queue[T] {
    sentinel := &Node[T]{}
    head := atomic.Pointer[Node[T]]{}
    tail := atomic.Pointer[Node[T]]{}
    head.Store(sentinel)
    tail.Store(sentinel)

    return &Queue[T]{
        head: &head,
        tail: &tail,
    }
}

// Enqueue adds element to queue.
func (q *Queue[T]) Enqueue(val T) {
    newNode := &Node[T]{value: val}

    for {
        last := q.tail.Load()
        next := last.next

        if atomic.CompareAndSwapPointer(
            (*unsafe.Pointer)(unsafe.Pointer(&last.next)),
            unsafe.Pointer(next),
            unsafe.Pointer(newNode),
        ) {
            // Try to advance tail (may fail, that's ok)
            q.tail.CompareAndSwap(last, newNode)
            return
        }

        // Help advance tail if needed
        q.tail.CompareAndSwap(
            last,
            (*Node[T])(atomic.LoadPointer(
                (*unsafe.Pointer)(unsafe.Pointer(&last.next)))),
        )
    }
}

// Dequeue removes and returns head element.
// Returns zero value and false if empty.
func (q *Queue[T]) Dequeue() (T, bool) {
    var zero T

    for {
        first := q.head.Load()
        last := q.tail.Load()
        next := first.next

        if first == last {
            if next == nil {
                return zero, false  // empty
            }
            // Tail is behind; help advance it
            q.tail.CompareAndSwap(last, next)
            continue
        }

        if next == nil {
            continue  // Racing with enqueue
        }

        if q.head.CompareAndSwap(first, next) {
            return next.value, true
        }
    }
}

Lock-Free Stack (Treiber Stack)

type Stack[T any] struct {
    top *atomic.Pointer[Node[T]]
}

func NewStack[T any]() *Stack[T] {
    return &Stack[T]{
        top: &atomic.Pointer[Node[T]]{},
    }
}

func (s *Stack[T]) Push(val T) {
    newNode := &Node[T]{value: val}
    for {
        top := s.top.Load()
        newNode.next = top
        if s.top.CompareAndSwap(top, newNode) {
            return
        }
    }
}

func (s *Stack[T]) Pop() (T, bool) {
    var zero T
    for {
        top := s.top.Load()
        if top == nil {
            return zero, false
        }
        if s.top.CompareAndSwap(top, top.next) {
            return top.value, true
        }
    }
}

The ABA Problem

ABA problem occurs when a value changes A→B→A and CAS doesn't notice:

Thread 1: Load X (value=A, counter=5)
Thread 2: Load X, change to B, change back to A, increment counter to 6
Thread 1: CAS succeeds because value == A, but counter changed!

Solutions:

Tagged Pointers: Include version in pointer

type TaggedPtr struct {
    addr  uintptr
    count uint32
}

func (tp *TaggedPtr) CAS(old, new TaggedPtr) bool {
    // Compare both address and version
    return atomic.CompareAndSwapUint64(&tp.addr,
        uint64(old.addr)|(uint64(old.count)<<32),
        uint64(new.addr)|(uint64(new.count)<<32))
}

Hazard Pointers: Track pointers that are "in use"
Epoch-Based Reclamation: Defer memory reclamation

When Lock-Free Is Actually Faster

Lock-free structures have hidden costs (CAS loops, memory barriers), making them slower at low contention:

// Benchmark: different contention levels
const numElements = 100_000

func BenchmarkMutexQueueLowContention(b *testing.B) {
    q := NewMutexQueue[int]()
    done := make(chan bool, 2)

    go func() {  // Single enqueuer
        for i := 0; i < b.N; i++ {
            q.Enqueue(i)
        }
        done <- true
    }()

    go func() {  // Single dequeuer
        for i := 0; i < b.N; i++ {
            q.Dequeue()
        }
        done <- true
    }()

    <-done
    <-done
}

func BenchmarkLockFreeQueueLowContention(b *testing.B) {
    // Same test with lock-free queue
    // Result: Similar or slightly slower due to CAS overhead
}

func BenchmarkMutexQueueHighContention(b *testing.B) {
    q := NewMutexQueue[int]()
    done := make(chan bool, 16)

    // 8 enqueuers, 8 dequeuers
    for i := 0; i < 16; i++ {
        go func(idx int) {
            for j := 0; j < b.N; j++ {
                if idx < 8 {
                    q.Enqueue(j)
                } else {
                    q.Dequeue()
                }
            }
            done <- true
        }(i)
    }

    for i := 0; i < 16; i++ {
        <-done
    }
}

// Results at high contention (8 threads, 8 threads):
// BenchmarkMutexQueueHighContention-16     245,000  4,800 ns/op
// BenchmarkLockFreeQueueHighContention-16  8,900,000  138 ns/op

At high contention (many cores fighting for same lock), lock-free queues are 35x faster. At low contention, lock-free is 10-20% slower. Choose lock-free only when contention is measurably high.

Skip Lists: Probabilistic Sorted Structures

Skip lists provide O(log n) search, insertion, and deletion with simpler implementation than balanced trees.

How Skip Lists Work

A skip list is a linked list with additional "express lanes" at higher levels:

Level 2:  [Head] ------> [15] ------> [inf]
Level 1:  [Head] -> [6] -> [15] -> [30] -> [inf]
Level 0:  [Head] -> [1] -> [6] -> [15] -> [20] -> [30] -> [50] -> [inf]

Each node randomly chooses a level (typically levels decrease by half). Search starts at top level, drops down when overshooting target.

Skip List Implementation in Go

package skiplist

import (
    "cmp"
    "math/rand"
)

const maxLevel = 32

type SkipList[K cmp.Ordered, V any] struct {
    head *Node[K, V]
    tail *Node[K, V]
    len  int
}

type Node[K cmp.Ordered, V any] struct {
    key   K
    value V
    level int
    next  []*Node[K, V]
    prev  []*Node[K, V]
}

func New[K cmp.Ordered, V any]() *SkipList[K, V] {
    head := &Node[K, V]{level: maxLevel - 1}
    head.next = make([]*Node[K, V], maxLevel)
    head.prev = make([]*Node[K, V], maxLevel)

    sl := &SkipList[K, V]{
        head: head,
    }

    return sl
}

func (sl *SkipList[K, V]) Search(key K) (V, bool) {
    var zero V
    node := sl.head

    for level := maxLevel - 1; level >= 0; level-- {
        for node.next[level] != nil && node.next[level].key < key {
            node = node.next[level]
        }
    }

    node = node.next[0]
    if node != nil && node.key == key {
        return node.value, true
    }
    return zero, false
}

func (sl *SkipList[K, V]) Insert(key K, value V) {
    level := randomLevel()
    newNode := &Node[K, V]{
        key:   key,
        value: value,
        level: level,
        next:  make([]*Node[K, V], level+1),
        prev:  make([]*Node[K, V], level+1),
    }

    path := make([]*Node[K, V], maxLevel)
    node := sl.head

    for l := maxLevel - 1; l >= 0; l-- {
        for node.next[l] != nil && node.next[l].key < key {
            node = node.next[l]
        }
        path[l] = node
    }

    for l := 0; l <= level; l++ {
        newNode.next[l] = path[l].next[l]
        path[l].next[l] = newNode
    }

    sl.len++
}

func randomLevel() int {
    level := 0
    for rand.Float64() < 0.5 && level < maxLevel-1 {
        level++
    }
    return level
}

func (sl *SkipList[K, V]) Len() int {
    return sl.len
}

// Range query example
func (sl *SkipList[K, V]) RangeQuery(start, end K, fn func(K, V) error) error {
    node := sl.head

    // Seek to start
    for level := maxLevel - 1; level >= 0; level-- {
        for node.next[level] != nil && node.next[level].key < start {
            node = node.next[level]
        }
    }

    node = node.next[0]

    // Iterate to end
    for node != nil && node.key <= end {
        if err := fn(node.key, node.value); err != nil {
            return err
        }
        node = node.next[0]
    }
    return nil
}

Skip lists excel at:

Leaderboards: Fast lookups, insertions, and range queries
Sorted iteration: O(log n) skip to position, then O(k) for k elements
Database indexes: Redis sorted sets use skip lists internally
CRDT coordination: Simpler than AVL/B-tree for distributed systems

Bitsets and Bitmap Indexes

Bitsets pack boolean data into bits, reducing memory by 8x-64x compared to maps or slices.

Compact Boolean Storage

import "github.com/bits-and-blooms/bitset"

// Standard approach: 1 bit per boolean
type FeatureFlags struct {
    bs bitset.BitSet
}

func (ff *FeatureFlags) Enable(flag uint) {
    ff.bs.Set(flag)
}

func (ff *FeatureFlags) IsEnabled(flag uint) bool {
    return ff.bs.Test(flag)
}

func (ff *FeatureFlags) Disable(flag uint) {
    ff.bs.Clear(flag)
}

// Usage
const (
    BetaFeature uint = 0
    DarkMode    uint = 1
    Analytics   uint = 2
)

flags := &FeatureFlags{}
flags.Enable(DarkMode)
flags.Enable(Analytics)

if flags.IsEnabled(DarkMode) {
    applyTheme("dark")
}

For N users with 1000 feature flags:

bitset: N * 1000 / 8 = 125 KB per million users
map[uint]bool: 3-8 MB per million users
struct with bool fields: 1000 bytes per user = 1 GB

Roaring Bitmaps for Sparse Data

When data is sparse (billions of possible positions, few set), roaring bitmaps use container compression:

import roaring "github.com/RoaringBitmap/roaring/v2"

// Standard bitset wastes memory on sparse data
// Roaring: ~1-2 bytes per set bit vs 1 bit + overhead

// User 123,456 likes posts: [1, 15, 234, 45678, 1_000_000]
userLikes := roaring.New()
userLikes.Add(1, 15, 234, 45678, 1_000_000)

// Fast membership test
if userLikes.Contains(234) {
    println("User liked post 234")
}

// Set operations at CPU speed
user1Likes := roaring.New()
user1Likes.Add(1, 2, 3)
user2Likes := roaring.New()
user2Likes.Add(2, 3, 4)

// Intersection: posts both users like
commonLikes := roaring.And(user1Likes, user2Likes)  // [2, 3]

// Union: posts either user likes
allLikes := roaring.Or(user1Likes, user2Likes)  // [1, 2, 3, 4]

// XOR: posts only one user likes
uniqueLikes := roaring.Xor(user1Likes, user2Likes)  // [1, 4]

Bitset Use Cases

Permission system:

type Permissions struct {
    bits bitset.BitSet
}

const (
    CanRead   = 0
    CanWrite  = 1
    CanDelete = 2
    CanAdmin  = 3
)

func (p *Permissions) Grant(permission uint) {
    p.bits.Set(permission)
}

func (p *Permissions) Has(permission uint) bool {
    return p.bits.Test(permission)
}

func (p *Permissions) HasAll(permissions ...uint) bool {
    for _, perm := range permissions {
        if !p.bits.Test(perm) {
            return false
        }
    }
    return true
}

Analytics and counts:

// Track which hours had traffic (24-bit mask per day)
type DailyHours struct {
    bits uint32  // 32 bits = can track 32 hours
}

func (dh *DailyHours) RecordHour(hour uint) {
    dh.bits |= (1 << hour)
}

func (dh *DailyHours) ActiveHours() []uint {
    var hours []uint
    for i := uint(0); i < 24; i++ {
        if dh.bits&(1<<i) != 0 {
            hours = append(hours, i)
        }
    }
    return hours
}

Benchmark: Bitset vs map[int]bool

const numFlags = 10_000

func BenchmarkBitsetTest(b *testing.B) {
    bs := bitset.New(numFlags)
    for i := uint(0); i < numFlags; i += 2 {
        bs.Set(i)
    }

    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        _ = bs.Test(uint(i % numFlags))
    }
}

func BenchmarkMapTest(b *testing.B) {
    m := make(map[int]bool)
    for i := 0; i < numFlags; i += 2 {
        m[i] = true
    }

    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        _ = m[i%numFlags]
    }
}

// Results:
// BenchmarkBitsetTest-16   2,480,000,000  0.48 ns/op
// BenchmarkMapTest-16      1,680,000,000  0.71 ns/op
//
// Memory (10K flags):
// bitset: 1.25 KB
// map:    ~200-400 KB

B-Trees and B+Trees: Cache-Friendly Sorted Containers

B-Trees cluster related data, improving cache locality for range scans.

B-Tree Advantages

A B-Tree node can hold multiple key-value pairs, reducing tree height and cache misses:

Standard binary tree (height 20, lots of cache misses):
         [50]
        /    \
      [25]   [75]
      /  \   /  \
    [12][37][62][88]
    ...

B-Tree (height 3, better cache locality):
    [25, 50, 75]
    /     |     |     \
  [1-24][26-49][51-74][76-100]

Using google/btree

import "github.com/google/btree"

type Item struct {
    Key   int
    Value string
}

func (i *Item) Less(other btree.Item) bool {
    return i.Key < other.(*Item).Key
}

// Create B-Tree with degree 16 (reasonable balance)
tree := btree.New[*Item](16)

// Insert items
tree.ReplaceOrInsert(&Item{Key: 50, Value: "fifty"})
tree.ReplaceOrInsert(&Item{Key: 30, Value: "thirty"})
tree.ReplaceOrInsert(&Item{Key: 70, Value: "seventy"})

// Lookup
if item, ok := tree.Get(&Item{Key: 50}); ok {
    println(item.Value)
}

// Range iteration (major B-Tree advantage)
tree.AscendRange(&Item{Key: 25}, &Item{Key: 75},
    func(i *Item) bool {
        println(i.Key, i.Value)
        return true
    })

// Nearest neighbor
if item := tree.AscendFirstGreaterOrEqual(&Item{Key: 55}); item != nil {
    println("First >= 55:", item.Key)
}

Benchmark: B-Tree vs Map for Range Scans

const numItems = 100_000

func BenchmarkMapRangeScan(b *testing.B) {
    m := make(map[int]string)
    for i := 0; i < numItems; i++ {
        m[i] = fmt.Sprintf("item-%d", i)
    }

    // Must sort keys to range scan
    keys := make([]int, 0, len(m))
    for k := range m {
        keys = append(keys, k)
    }
    sort.Ints(keys)

    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        count := 0
        start := 25_000
        end := 75_000
        for _, k := range keys {
            if k >= start && k <= end {
                count++
            }
        }
    }
}

func BenchmarkBTreeRangeScan(b *testing.B) {
    tree := btree.New[*Item](16)
    for i := 0; i < numItems; i++ {
        tree.ReplaceOrInsert(&Item{Key: i, Value: fmt.Sprintf("item-%d", i)})
    }

    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        count := 0
        tree.AscendRange(&Item{Key: 25_000}, &Item{Key: 75_000},
            func(item *Item) bool {
                count++
                return true
            })
    }
}

// Results:
// BenchmarkMapRangeScan-16      1,200,000  1,150 ns/op (must scan entire key list)
// BenchmarkBTreeRangeScan-16   45,600,000    26.2 ns/op (natural ordering)

For range queries on sorted data, B-Trees are 40x faster because the iteration order is natural, not requiring external sorting.

Swiss Tables and SIMD-Accelerated Hash Maps

Go 1.24 includes Swiss Table-inspired optimizations in the map implementation. Earlier versions can use third-party packages.

// Go 1.24+ maps automatically use Swiss Table optimizations
m := make(map[string]int)
for i := 0; i < 1_000_000; i++ {
    m[fmt.Sprintf("key-%d", i)]++
}

// Performance improvements are automatic:
// - Better probe sequence (reduces cache misses)
// - SIMD-accelerated group matching
// - Reduced memory overhead

For applications requiring maximum control or targeting earlier Go versions:

// github.com/tidwall/hashmap offers optimized hash map
import "github.com/tidwall/hashmap"

hm := hashmap.New[string, int](10_000)
hm.Set("key1", 100)
val, ok := hm.Get("key1")

Tries: Prefix Trees for String Operations

Tries enable fast prefix matching and are particularly valuable for IP address lookups.

Basic Trie for Autocomplete

type TrieNode struct {
    children map[rune]*TrieNode
    isEnd    bool
    value    interface{}
}

type Trie struct {
    root *TrieNode
}

func NewTrie() *Trie {
    return &Trie{
        root: &TrieNode{
            children: make(map[rune]*TrieNode),
        },
    }
}

func (t *Trie) Insert(word string, value interface{}) {
    node := t.root
    for _, ch := range word {
        if _, ok := node.children[ch]; !ok {
            node.children[ch] = &TrieNode{
                children: make(map[rune]*TrieNode),
            }
        }
        node = node.children[ch]
    }
    node.isEnd = true
    node.value = value
}

func (t *Trie) Search(word string) (interface{}, bool) {
    node := t.root
    for _, ch := range word {
        if next, ok := node.children[ch]; ok {
            node = next
        } else {
            return nil, false
        }
    }
    if node.isEnd {
        return node.value, true
    }
    return nil, false
}

func (t *Trie) StartsWith(prefix string) []interface{} {
    var results []interface{}
    node := t.root
    for _, ch := range prefix {
        if next, ok := node.children[ch]; ok {
            node = next
        } else {
            return results
        }
    }

    t.dfs(node, &results)
    return results
}

func (t *Trie) dfs(node *TrieNode, results *[]interface{}) {
    if node.isEnd {
        *results = append(*results, node.value)
    }
    for _, child := range node.children {
        t.dfs(child, results)
    }
}

Compact Trie (Radix Tree) for IP Lookups

type RadixNode struct {
    edge     string
    children map[string]*RadixNode
    value    interface{}
}

type RadixTree struct {
    root *RadixNode
}

func NewRadixTree() *RadixTree {
    return &RadixTree{
        root: &RadixNode{
            children: make(map[string]*RadixNode),
        },
    }
}

// CIDR example: Insert 10.0.0.0/8 -> "US-East"
func (rt *RadixTree) Insert(key string, value interface{}) {
    node := rt.root
    remaining := key

    for remaining != "" {
        found := false
        for edge, child := range node.children {
            // Find longest common prefix
            i := 0
            for i < len(edge) && i < len(remaining) &&
                edge[i] == remaining[i] {
                i++
            }

            if i == 0 {
                continue
            }

            if i == len(edge) {
                // Entire edge matched, descend
                node = child
                remaining = remaining[i:]
                found = true
                break
            } else if i < len(edge) {
                // Partial match, split edge
                newChild := &RadixNode{
                    edge:     edge[i:],
                    children: child.children,
                    value:    child.value,
                }
                node.children[edge[:i]] = &RadixNode{
                    edge:     remaining[i:],
                    children: make(map[string]*RadixNode),
                    value:    value,
                }
                node.children[edge[:i]].children[newChild.edge] = newChild
                return
            }
        }

        if !found {
            // No matching edge, create new
            node.children[remaining] = &RadixNode{
                edge:     remaining,
                children: make(map[string]*RadixNode),
                value:    value,
            }
            return
        }
    }

    node.value = value
}

func (rt *RadixTree) Lookup(key string) (interface{}, bool) {
    node := rt.root
    remaining := key

    for remaining != "" {
        found := false
        for edge, child := range node.children {
            if len(remaining) >= len(edge) &&
               remaining[:len(edge)] == edge {
                node = child
                remaining = remaining[len(edge):]
                found = true
                break
            }
        }
        if !found {
            return nil, false
        }
    }

    return node.value, node.value != nil
}

Arena Allocation: Batch Memory Management

Arena allocation groups short-lived objects for efficient bulk deallocation.

Go 1.22 Experimental Arena Package

import "arena"

func ProcessBatch(records []Record) {
    // Allocate all temporary objects in an arena
    a := arena.NewArena()
    defer a.Free()  // Single deallocation for entire batch

    type Item struct {
        key   string
        value int
    }

    // All allocations within arena
    items := make([]Item, 0, len(records))
    for _, rec := range records {
        item := Item{
            key:   rec.Name,
            value: rec.Value,
        }
        items = append(items, item)
    }

    // Process items
    for _, item := range items {
        processItem(item)
    }

    // All item memory freed in single operation
}

Benefits:

Single allocation for group (cache-friendly)
Single deallocation (no GC overhead per object)
No pointer fragmentation
Typical speedup: 20-40% for object-creation-heavy code

Manual Arena with sync.Pool

type Arena struct {
    buffers *sync.Pool
}

func NewArena() *Arena {
    return &Arena{
        buffers: &sync.Pool{
            New: func() interface{} {
                return make([]byte, 0, 64*1024)  // 64KB buffer
            },
        },
    }
}

func (a *Arena) Allocate(size int) []byte {
    buf := a.buffers.Get().([]byte)
    if cap(buf)-len(buf) < size {
        // Buffer too small, get new one
        a.buffers.Put(buf)
        buf = a.buffers.Get().([]byte)
    }
    result := buf[len(buf) : len(buf)+size]
    buf = buf[:len(buf)+size]
    a.buffers.Put(buf)
    return result
}

Object Pools Beyond sync.Pool

sync.Pool is excellent for unbounded object recycling, but sometimes you need more control.

Fixed-Size Pool with Channel

type BufferPool struct {
    buffers chan []byte
}

func NewBufferPool(size int, bufferSize int) *BufferPool {
    bp := &BufferPool{
        buffers: make(chan []byte, size),
    }
    for i := 0; i < size; i++ {
        bp.buffers <- make([]byte, bufferSize)
    }
    return bp
}

func (bp *BufferPool) Get() []byte {
    select {
    case buf := <-bp.buffers:
        return buf[:0]  // Reset slice
    default:
        return make([]byte, 0, 64*1024)  // New buffer if pool empty
    }
}

func (bp *BufferPool) Put(buf []byte) {
    if cap(buf) >= 64*1024 {  // Only pool reasonable-sized buffers
        select {
        case bp.buffers <- buf:
        default:
            // Pool full, discard
        }
    }
}

// Usage
pool := NewBufferPool(100, 64*1024)

func handleRequest() {
    buf := pool.Get()
    defer pool.Put(buf)

    // Use buffer
    io.ReadFull(conn, buf)
}

Sharded Pools for Reduced Contention

type ShardedPool[T any] struct {
    shards []*sync.Pool
    mask   uint64
}

func NewShardedPool[T any](shards int, fn func() T) *ShardedPool[T] {
    // Round shards to power of 2
    shards = nextPowerOf2(shards)

    sp := &ShardedPool[T]{
        shards: make([]*sync.Pool, shards),
        mask:   uint64(shards - 1),
    }

    for i := 0; i < shards; i++ {
        // Capture fn in closure
        factory := fn
        sp.shards[i] = &sync.Pool{
            New: func() interface{} { return factory() },
        }
    }

    return sp
}

func (sp *ShardedPool[T]) Get() T {
    // Use goroutine ID to shard (reduces lock contention)
    shard := fastrand.Uint64() & sp.mask
    return sp.shards[shard].Get().(T)
}

func (sp *ShardedPool[T]) Put(val T) {
    shard := fastrand.Uint64() & sp.mask
    sp.shards[shard].Put(val)
}

// Usage
type Request struct {
    data []byte
}

pool := NewShardedPool(runtime.NumCPU(), func() *Request {
    return &Request{data: make([]byte, 4096)}
})

req := pool.Get().(*Request)
defer pool.Put(req)

Decision Matrix: Which Data Structure?

Problem	Structure	Trade-offs
Existence test (huge space)	Bloom filter	~1% false positive, 10x memory savings
Sorted iteration critical	B-Tree	Slightly slower point lookup for 40x faster range
Bounded FIFO queue	Ring buffer	Lock-free, fixed size, SPSC only
High-contention shared state	Lock-free queue	35x faster at high contention, 10% slower at low
Sorted data with updates	Skip list	Simpler than AVL, ~log(n) operations
Boolean flags, permissions	Bitset	256x memory savings vs map[int]bool
Sparse boolean data	Roaring bitmap	Compression for 99% zeros
CIDR/prefix matching	Radix trie	Fast longest-prefix match
Short-lived objects	Arena	20-40% allocation faster
Object recycling	Sharded pool	Reduced lock contention vs sync.Pool

Conclusion

Specialized data structures are force multipliers for performance-critical code:

Bloom filters cut memory usage by 10-30x for negative-cache workloads
Ring buffers achieve lock-free latency for bounded queues
Lock-free structures scale linearly to 8+ cores at high contention
B-Trees make range scans 40x faster
Bitsets shrink boolean storage by 256x
Tries enable O(k) prefix matching on k-length strings

Measure first (profile, benchmark, identify bottleneck), then apply. Premature specialization adds complexity without payoff. But when the math works, specialized structures turn milliseconds into microseconds.

Specialized Data Structures for Performance

On this page