Go Performance Guide
Compiler & Runtime

CGO Performance

Understand CGO overhead, when to use C libraries, and strategies for minimizing the performance cost of crossing language boundaries.

What is CGO?

CGO is Go's foreign function interface (FFI) for calling C code and vice versa. It enables integrating high-performance C libraries directly into Go applications.

package main

import "C"

//export AddNumbers
func AddNumbers(a C.int, b C.int) C.int {
    return a + b
}

func main() {
    result := C.addFromC(5, 3)
}

While CGO is powerful, it carries a significant performance cost that must be understood when designing your application's architecture.

The Overhead Problem: CGO Call Cost

A CGO call is dramatically slower than a native Go function call.

Typical costs:

  • Pure Go function call: 1-5 nanoseconds
  • CGO call: 100-200 nanoseconds

This is a 20-100x slowdown for the call itself, before any computation happens.

Why is CGO So Slow?

The overhead comes from several sources:

1. Goroutine to OS Thread Transition

Go's goroutines are lightweight user-space threads, while C functions require OS threads. CGO must transition from a goroutine to an OS thread:

Goroutine → Lock M (machine) →
Transition to OS thread → Execute C code →
Transition back to Goroutine → Unlock M

This involves context switching and synchronization overhead.

2. Stack Switching

Go goroutines use segmented stacks (can grow), while C uses fixed stacks. CGO must switch between them:

func callC() {
    // Save goroutine stack state
    // Switch to OS thread's C stack (fixed size)
    // Call C function
    // Switch back to goroutine stack
    // Restore goroutine state
}

3. Signal Mask Manipulation

Go uses signals for runtime functions (GC, goroutine scheduling). Before calling C code, Go must:

  • Save the signal mask
  • Reset to a safe mask
  • Restore after C returns

4. Thread Pinning and Mutex Operations

The calling goroutine is pinned to an OS thread during the CGO call. This involves lock acquisition and release, adding synchronization overhead.

5. Type Conversion and Memory Marshaling

Arguments and return values must be converted from Go types to C types:

// Each of these causes memory allocation and copying
s := "hello"
cStr := C.CString(s)  // Allocates C memory, copies string
defer C.free(unsafe.Pointer(cStr))

// Passing Go slice to C requires conversion
goSlice := []int64{1, 2, 3}
cArray := (*C.long_long)(unsafe.Pointer(&goSlice[0]))  // Unsafe conversion

Benchmark: CGO Overhead

Let's measure the actual cost:

package main

import (
    "C"
    "testing"
    "unsafe"
)

//export Add
func Add(a, b C.int) C.int {
    return a + b
}

func addGo(a, b int) int {
    return a + b
}

func BenchmarkCGOCall(b *testing.B) {
    for i := 0; i < b.N; i++ {
        C.Add(C.int(i), C.int(i+1))
    }
}

func BenchmarkGoCall(b *testing.B) {
    for i := 0; i < b.N; i++ {
        addGo(i, i+1)
    }
}

// More realistic: converting strings
func BenchmarkCGOStringConversion(b *testing.B) {
    input := "hello world"
    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        cStr := C.CString(input)
        // Simulate C function call
        _ = cStr
        C.free(unsafe.Pointer(cStr))
    }
}

func BenchmarkGoStringUsage(b *testing.B) {
    input := "hello world"
    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        // Native Go string operations
        _ = len(input)
        _ = input[0]
    }
}

Expected Results

BenchmarkCGOCall-8           10000000   125 ns/op    0 B/op   0 allocs/op
BenchmarkGoCall-8            1000000000  0.5 ns/op   0 B/op   0 allocs/op
BenchmarkCGOStringConversion-8  2000000   650 ns/op   16 B/op   1 allocs/op
BenchmarkGoStringUsage-8     1000000000  0.8 ns/op    0 B/op   0 allocs/op

The CGO overhead dominates - pure call cost is 250x, with memory conversion it's 800x!

When CGO is Worth the Cost

Despite the overhead, CGO is valuable in specific scenarios:

1. Leveraging Highly Optimized C Libraries

Some libraries are so efficient that the CGO overhead is negligible compared to the computation:

OpenSSL for Cryptography:

func encryptAES(key, plaintext []byte) ([]byte, error) {
    // OpenSSL call takes 1-5 microseconds
    // CGO overhead: 0.125 microseconds
    // Total: ~1% overhead
    // Using pure Go crypto: slower algorithms
}

When the C function takes 1000+ nanoseconds, the 125ns CGO overhead is only 10% of execution time.

2. SQLite Database Engine

SQLite is a compact, well-tested database. Using it via CGO is often faster than pure Go implementations:

// Using CGO sqlite3
import "github.com/mattn/go-sqlite3"

db, _ := sql.Open("sqlite3", ":memory:")
db.Exec("INSERT INTO users VALUES (?)", 1)  // ~10 microseconds

// Using pure Go sqlite (modernc.org/sqlite)
db, _ := sql.Open("sqlite", "file:memdb")
db.Exec("INSERT INTO users VALUES (?)", 1)  // ~50 microseconds

The CGO overhead is 1 microsecond out of 10 total, but the C library is fundamentally faster.

3. Image Processing and Codecs

libjpeg for JPEG decoding:

// C libjpeg: 10-50 milliseconds for large image
// Pure Go JPEG: 50-200 milliseconds
// CGO overhead: 0.125 milliseconds (0.25% of time)

The computation time dominates, making CGO overhead negligible.

4. Signal Processing and SIMD

C libraries with SIMD optimizations can provide 4-8x speedup:

// C library with AVX-512: 100 ns per operation
// Pure Go: 400 ns per operation
// CGO overhead: 125 ns (20% cost, 4x speedup overall)

When to Avoid CGO

1. Thin Wrappers Around Simple Operations

// ✗ Bad - CGO overhead exceeds computation
//export GoMin
func GoMin(a, b C.int) C.int {
    if a < b {
        return a
    }
    return b
}

// Calling this 1 billion times:
// 125 ns CGO × 1B = 125 seconds just calling!
// Pure Go: 0.5 ns × 1B = 0.5 seconds

2. Calling CGO in Hot Loops

// ✗ Bad
func processMillions(data []int) {
    for _, v := range data {
        result := C.ProcessValue(C.int(v))  // 1 billion CGO calls!
        _ = result
    }
}

// ✓ Good
func processMillions(data []int) {
    results := make([]C.int, len(data))
    for i, v := range data {
        results[i] = C.int(v)
    }
    C.ProcessArray((*C.int)(unsafe.Pointer(&results[0])), C.int(len(results)))
}

3. Type Conversions in Critical Paths

// ✗ Bad - allocation for each call
for i := 0; i < 1000000; i++ {
    s := fmt.Sprintf("item_%d", i)
    cStr := C.CString(s)
    C.ProcessString(cStr)  // String allocation + CGO call!
    C.free(unsafe.Pointer(cStr))
}

// ✓ Better - allocate once
cStrs := make([]*C.char, 1000000)
for i := 0; i < 1000000; i++ {
    s := fmt.Sprintf("item_%d", i)
    cStrs[i] = C.CString(s)
}
C.ProcessStringArray((**C.char)(unsafe.Pointer(&cStrs[0])), C.int(len(cStrs)))
for _, cStr := range cStrs {
    C.free(unsafe.Pointer(cStr))
}

Strategy 1: Batching CGO Calls

The key to minimizing CGO overhead is amortizing the cost across multiple operations.

Example: Processing Data in Batches

// ✗ Poor - 1M CGO calls for 1M elements
for _, item := range items {
    C.ProcessItem((*C.char)(unsafe.Pointer(unsafe.StringData(item))))
}

// ✓ Better - 1 CGO call for 1M elements
batch := make([]*C.char, 0, len(items))
for _, item := range items {
    batch = append(batch, C.CString(item))
}
defer func() {
    for _, p := range batch {
        C.free(unsafe.Pointer(p))
    }
}()

C.ProcessBatch((**C.char)(unsafe.Pointer(&batch[0])), C.int(len(batch)))

Benchmark: Batching vs Per-Item

func BenchmarkUnbatched(b *testing.B) {
    items := make([]string, 1000)
    for i := range items {
        items[i] = fmt.Sprintf("item_%d", i)
    }

    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        for _, item := range items {
            C.ProcessString(C.CString(item))
        }
    }
}

func BenchmarkBatched(b *testing.B) {
    items := make([]string, 1000)
    for i := range items {
        items[i] = fmt.Sprintf("item_%d", i)
    }

    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        batch := make([]*C.char, len(items))
        for j, item := range items {
            batch[j] = C.CString(item)
        }
        C.ProcessBatch((**C.char)(unsafe.Pointer(&batch[0])), C.int(len(items)))
        for _, p := range batch {
            C.free(unsafe.Pointer(p))
        }
    }
}

Results show batching reduces per-item cost from 2000 ns to 200 ns (10x improvement).

Strategy 2: Memory Management in CGO

Proper memory management is critical for both safety and performance.

Allocating Memory

// Go memory
goMem := make([]byte, 1024)

// C memory
cMem := C.malloc(1024)
defer C.free(cMem)

// Passing Go memory to C requires careful handling
// Go slices are managed by GC, C memory is not

The Cgo Pointer Rules

Go enforces strict pointer rules to prevent memory corruption:

  1. Go code may pass Go pointers to C
  2. C code may not keep Go pointers after function returns
  3. C code may not store Go pointers in C memory
// ✓ Safe - C doesn't keep the pointer
var goArray [100]int
C.ProcessArray((*C.int)(unsafe.Pointer(&goArray[0])))

// ✗ Unsafe - C stores Go pointer
var ptr *unsafe.Pointer
C.StorePointer(ptr)  // C code stores goPtr, panic!

Avoiding Unnecessary Allocations

// ✗ Allocates a new C string every time
func process(s string) error {
    cStr := C.CString(s)
    defer C.free(unsafe.Pointer(cStr))
    return checkError(C.DoSomething(cStr))
}

// ✓ Reuse buffer if calling multiple times
func processMany(strings []string) error {
    for _, s := range strings {
        cStr := C.CString(s)
        if err := checkError(C.DoSomething(cStr)); err != nil {
            C.free(unsafe.Pointer(cStr))
            return err
        }
        C.free(unsafe.Pointer(cStr))
    }
    return nil
}

Strategy 3: Build Configuration and Cross-Compilation

CGO comes with significant build complexity.

Build Time Overhead

# Pure Go - fast
go build
# 0.5 seconds

# CGO - slow
go build  # Requires C compiler
# 5-10 seconds

Cross-Compilation

# Pure Go
CGO_ENABLED=0 GOOS=linux GOARCH=amd64 go build

# CGO - cannot cross-compile easily
GOOS=linux GOARCH=amd64 go build  # Requires linux toolchain!

Conditional CGO

//go:build cgo
// +build cgo

package main

import "C"

// Only compiled when cgo is enabled

Alternative 1: Pure Go Implementations

Many C libraries have pure Go ports:

SQLite Example

// Using CGO (mattn/go-sqlite3)
import _ "github.com/mattn/go-sqlite3"
db, _ := sql.Open("sqlite3", "file.db")

// Using pure Go (modernc.org/sqlite)
import _ "modernc.org/sqlite"
db, _ := sql.Open("sqlite", "file.db")

Performance comparison:

  • CGO sqlite3: 10 µs per INSERT
  • Pure Go sqlite: 50 µs per INSERT
  • Overhead acceptable for most use cases

Advantages of pure Go:

  • Cross-platform compatibility
  • No CGO overhead
  • Easier deployment
  • Simpler security model

Alternative 2: WASM (WebAssembly)

For some use cases, compiling C code to WASM is viable:

# Compile C to WASM
emcripten mylib.c -o mylib.js

# Use in Go via JS shim
go tool go2wasm ...

This provides isolation and better error handling than direct CGO.

Alternative 3: Shared Memory and mmap

For large datasets, share memory via files instead of function calls:

// Instead of processing in C via CGO
// Write data to file, let C process it separately
data := []byte{...}
ioutil.WriteFile("input.bin", data, 0644)

// Call C program as external process
cmd := exec.Command("./c_processor", "input.bin", "output.bin")
cmd.Run()

// Read results
results, _ := ioutil.ReadFile("output.bin")

Eliminates CGO overhead entirely, trades off latency.

Alternative 4: gRPC to Sidecar

For complex integrations, run C code in a separate process:

// Go service (port 5000)
package main

import "google.golang.org/grpc"

func main() {
    conn, _ := grpc.Dial("localhost:5001")  // Connect to C service
    client := pb.NewProcessorClient(conn)
    result, _ := client.Process(ctx, &pb.Request{Data: data})
}

// C service (port 5001) listens for gRPC requests
// Benefits: Isolation, process separation, language-agnostic

Tradeoffs:

  • Network latency (microseconds)
  • Process isolation and safety
  • Easy to replace or scale C component

Using purego for CGO-Free C Calls

The purego library allows calling C functions without CGO:

import "github.com/ebitengine/purego"

func init() {
    purego.RegisterFunc(&someFunc, "someFunc")
}

var someFunc func(int, int) int

func main() {
    result := someFunc(5, 3)  // Calls C function
}

Advantages:

  • No CGO build overhead
  • Still pays call overhead (but slightly less)
  • Cross-compilation friendly

Disadvantages:

  • Manual function registration
  • Less safe than CGO
  • Still has call overhead

Real-World Benchmark: sqlite3 via CGO vs Pure Go

package main

import (
    "database/sql"
    "testing"

    _ "github.com/mattn/go-sqlite3"  // CGO
    // _ "modernc.org/sqlite"  // Pure Go
)

func BenchmarkInsertCGO(b *testing.B) {
    db, _ := sql.Open("sqlite3", ":memory:")
    db.Exec("CREATE TABLE test (id INTEGER PRIMARY KEY, value TEXT)")

    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        db.Exec("INSERT INTO test (value) VALUES (?)", "data")
    }
}

func BenchmarkInsertPureGo(b *testing.B) {
    db, _ := sql.Open("sqlite", "file::memory:?cache=shared")
    db.Exec("CREATE TABLE test (id INTEGER PRIMARY KEY, value TEXT)")

    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        db.Exec("INSERT INTO test (value) VALUES (?)", "data")
    }
}

Results:

BenchmarkInsertCGO-8      150000    6500 ns/op    ← 1 µs CGO overhead
BenchmarkInsertPureGo-8    30000   35000 ns/op    ← Pure Go slower but acceptable

Guidelines for Using CGO

Tip: Profile your code with CGO disabled to understand the actual impact. Use CGO_ENABLED=0 to disable CGO entirely.

  1. Only use CGO for substantial computations - when C work >> 100+ nanoseconds

  2. Batch operations - make one CGO call instead of many

  3. Avoid CGO in hot loops - measure impact with profiling

  4. Consider pure Go alternatives - they're often faster than expected

  5. Profile before optimizing - use go tool pprof to find actual bottlenecks

  6. Test cross-compilation needs - CGO breaks ease of deployment

  7. Document the tradeoff - explain why CGO was chosen over pure Go

Summary

CGO enables leveraging high-performance C libraries, but at a significant cost (100-200 ns per call). This overhead is worthwhile only when:

  • The C computation takes microseconds or more
  • Operations can be batched to amortize call overhead
  • No pure Go equivalent exists or is significantly slower
  • Deployment and cross-compilation complexity is acceptable

For most Go applications, pure Go implementations are preferable. Reserve CGO for genuinely compute-intensive operations where the C library provides substantial (10x+) speedup over pure Go alternatives.

The key principle: Make the CGO calls worth the overhead by doing significant work in C, not thin wrappers around simple operations.

On this page