OS-Level Tuning for Go Applications

Linux kernel parameters, TCP tuning, file descriptor limits, huge pages, NUMA awareness, io_uring, and system-level optimizations for high-performance Go services.

Operating system configuration is often overlooked in Go performance optimization, yet it has profound effects on application behavior. A misconfigured Linux system can create artificial bottlenecks that no amount of Go code optimization can overcome. This article explores the critical OS-level parameters that affect Go service performance and provides practical tuning strategies.

File Descriptor Limits

File descriptors are the foundation of I/O in Unix-like systems. Every network connection, file, pipe, and socket consumes a file descriptor. Go applications, with their goroutine-per-connection model, are particularly susceptible to file descriptor exhaustion.

Understanding FD Limits

The Linux kernel maintains multiple file descriptor limits:

User-level soft limit (ulimit -n)

Maximum number of file descriptors a single process can open
Can be increased by the process itself (up to the hard limit)
Default is typically 1024, dangerously low for modern services

User-level hard limit

Maximum value the soft limit can be increased to without root
Set in /etc/security/limits.conf or PAM configuration
Requires root to exceed

System-wide limit (/proc/sys/fs/file-max)

Maximum number of file descriptors across the entire system
Default typically 10% of available RAM
Affects all processes combined

Calculating Required FD Limits

For a Go HTTP server:

// Theoretical maximum connections per Go process
maxConnections := goMaxProcs * goroutinesPerCore * connectionBufferRatio

// Add overhead for:
// - Server listen sockets
// - DNS resolver sockets
// - Logging file descriptors
// - Database connections
// - Other I/O resources

recommendedLimit := maxConnections * 1.5 // Safety margin

A production service handling 100k concurrent connections needs:

Each connection ≈ 1 FD (HTTP requests)
Database pool: 50-200 connections
Listen socket: 1 per port
Recommended: 120,000-150,000 FD limit

Configuration Methods

Temporary (per-shell session):

ulimit -n 65536  # Soft limit
ulimit -H -n 65536  # Hard limit

Persistent (PAM limits.conf):

# /etc/security/limits.conf
*              soft  nofile  65536
*              hard  nofile  65536
*              soft  nproc   65536
*              hard  nproc   65536

For systemd services:

# /etc/systemd/system/myapp.service
[Service]
LimitNOFILE=65536
LimitNPROC=65536

In Go code (informational only):

package main

import (
	"fmt"
	"syscall"
)

func getFileDescriptorLimits() error {
	var rlim syscall.Rlimit
	if err := syscall.Getrlimit(syscall.RLIMIT_NOFILE, &rlim); err != nil {
		return err
	}

	fmt.Printf("Soft limit: %d\n", rlim.Cur)
	fmt.Printf("Hard limit: %d\n", rlim.Max)

	// Attempt to increase if below recommended
	if rlim.Cur < 65536 {
		rlim.Cur = 65536
		if err := syscall.Setrlimit(syscall.RLIMIT_NOFILE, &rlim); err != nil {
			fmt.Printf("Warning: Could not increase FD limit: %v\n", err)
		}
	}
	return nil
}

Detecting FD Exhaustion

When approaching FD limits, Go applications fail with cryptic "too many open files" errors:

package main

import (
	"log"
	"net"
	"os"
	"runtime"
	"syscall"
	"time"
)

func monitorFDUsage() {
	ticker := time.NewTicker(30 * time.Second)
	defer ticker.Stop()

	var rlim syscall.Rlimit
	syscall.Getrlimit(syscall.RLIMIT_NOFILE, &rlim)

	for range ticker.C {
		// Count open FDs in /proc
		fdDir := "/proc/self/fd"
		entries, err := os.ReadDir(fdDir)
		if err != nil {
			continue
		}

		usage := len(entries)
		percentage := float64(usage) / float64(rlim.Cur) * 100

		// Alert when approaching limit
		if percentage > 80 {
			log.Printf("WARNING: FD usage at %.1f%% (%d/%d)\n",
				percentage, usage, rlim.Cur)
		}

		var m runtime.MemStats
		runtime.ReadMemStats(&m)
		log.Printf("FDs: %d/%d (%.1f%%) | Goroutines: %d | Memory: %d MB\n",
			usage, rlim.Cur, percentage, runtime.NumGoroutine(),
			m.Alloc/1024/1024)
	}
}

func init() {
	go monitorFDUsage()
}

System-Wide FD Limit

Check the system-wide limit:

cat /proc/sys/fs/file-max
# Increase if needed
echo 2097152 | sudo tee /proc/sys/fs/file-max

For permanent configuration, add to /etc/sysctl.conf:

fs.file-max = 2097152

TCP Tuning for High-Throughput

TCP parameter tuning is critical for Go services handling thousands of concurrent connections. The default kernel settings are conservative, optimized for general workloads rather than high-performance scenarios.

Listen Socket Backlog

The listen socket maintains a queue of incoming connections waiting to be accepted:

net.core.somaxconn: Maximum length of the accept backlog

# Default: 128 (too low for high-concurrency services)
cat /proc/sys/net/core/somaxconn
sudo sysctl -w net.core.somaxconn=4096

# Go applications must also set this via syscall

net.ipv4.tcp_max_syn_backlog: SYN flood protection queue

# Default: 1024
sudo sysctl -w net.ipv4.tcp_max_syn_backlog=4096

net.core.netdev_max_backlog: Network device input queue

# Default: 1000
sudo sysctl -w net.core.netdev_max_backlog=2000

Setting backlog in Go:

package main

import (
	"net"
	"syscall"
)

func createListenerWithBacklog(addr string, backlog int) (net.Listener, error) {
	lc := net.ListenConfig{
		Control: func(network, address string, c syscall.RawConn) error {
			var opErr error
			err := c.Control(func(fd uintptr) {
				// Set listen backlog (may require elevated privileges)
				opErr = syscall.Listen(int(fd), backlog)
			})
			if err != nil {
				return err
			}
			return opErr
		},
	}

	return lc.Listen(context.Background(), "tcp", addr)
}

TIME_WAIT Socket Reuse

Closing connections creates TIME_WAIT sockets (30-60 second hold) to prevent duplicate packets. This accumulates quickly under load:

net.ipv4.tcp_tw_reuse: Reuse TIME_WAIT sockets for new connections (client-side)

sudo sysctl -w net.ipv4.tcp_tw_reuse=1

net.ipv4.tcp_tw_recycle: Aggressive TIME_WAIT recycling (dangerous, causes dropped packets)

# Generally NOT recommended - can cause issues with NAT
# Only use if you fully control the network
sudo sysctl -w net.ipv4.tcp_tw_recycle=0

net.ipv4.tcp_fin_timeout: Duration of TIME_WAIT in seconds

# Default: 60 (too long for high-churn workloads)
# Reduce to 30 or 20 for services with many short-lived connections
sudo sysctl -w net.ipv4.tcp_fin_timeout=30

TCP Buffer Sizes

Kernel buffers for TCP send/receive:

# Default: 87380 bytes
cat /proc/sys/net/core/rmem_default
cat /proc/sys/net/core/wmem_default

# Maximums:
cat /proc/sys/net/core/rmem_max
cat /proc/sys/net/core/wmem_max

# For high-throughput, long-latency connections (e.g., geo-distributed):
sudo sysctl -w net.core.rmem_max=134217728  # 128MB
sudo sysctl -w net.core.wmem_max=134217728  # 128MB

# Per-protocol tuning:
sudo sysctl -w net.ipv4.tcp_rmem="4096 87380 67108864"
sudo sysctl -w net.ipv4.tcp_wmem="4096 65536 67108864"

Setting socket buffer sizes in Go:

import (
	"net"
	"syscall"
)

func dialWithBufferSize(addr string, bufferSize int) (net.Conn, error) {
	d := net.Dialer{
		Control: func(network, address string, c syscall.RawConn) error {
			return c.Control(func(fd uintptr) {
				syscall.SetsockoptInt(int(fd), syscall.SOL_SOCKET,
					syscall.SO_RCVBUF, bufferSize)
				syscall.SetsockoptInt(int(fd), syscall.SOL_SOCKET,
					syscall.SO_SNDBUF, bufferSize)
			})
		},
	}
	return d.Dial("tcp", addr)
}

TCP Keepalive

Detect dead connections and prevent resource leaks:

# Time before sending keepalive probe (default: 7200s = 2 hours)
sudo sysctl -w net.ipv4.tcp_keepalive_time=600

# Interval between probes (default: 75s)
sudo sysctl -w net.ipv4.tcp_keepalive_intvl=60

# Number of probes before giving up (default: 9)
sudo sysctl -w net.ipv4.tcp_keepalive_probes=5

In Go, set keepalive on listener:

import (
	"net"
	"syscall"
	"time"
)

func setKeepalive(conn net.Conn, idle, interval time.Duration, count int) error {
	tcpConn, ok := conn.(*net.TCPConn)
	if !ok {
		return nil
	}

	if err := tcpConn.SetKeepAlive(true); err != nil {
		return err
	}

	// Go 1.13+ supports setting keepalive parameters
	if err := tcpConn.SetKeepAlivePeriod(idle); err != nil {
		return err
	}

	return nil
}

SO_REUSEPORT for Load Balancing

Allow multiple sockets to bind to the same address:port for userspace load balancing:

# Kernel support required (Linux 3.9+)
# In Go 1.11+, use syscall.SO_REUSEPORT

package main

import (
	"golang.org/x/sys/unix"
	"net"
	"syscall"
)

func createReusePortListener(port string) (net.Listener, error) {
	lc := net.ListenConfig{
		Control: func(network, address string, c syscall.RawConn) error {
			var opErr error
			err := c.Control(func(fd uintptr) {
				opErr = unix.SetsockoptInt(int(fd), unix.SOL_SOCKET,
					unix.SO_REUSEPORT, 1)
			})
			if err != nil {
				return err
			}
			return opErr
		},
	}

	return lc.Listen(context.Background(), "tcp", ":"+port)
}

// Usage: Multiple goroutines can listen on the same port
// Kernel distributes incoming connections fairly
func main() {
	numListeners := runtime.NumCPU()
	for i := 0; i < numListeners; i++ {
		ln, err := createReusePortListener("8080")
		if err != nil {
			panic(err)
		}
		go serveConnections(ln)
	}
}

This allows each CPU core to accept connections independently without contention, significantly improving throughput for accept-bound workloads.

TCP_NODELAY vs Nagle's Algorithm

Nagle's algorithm batches small packets to improve efficiency, but increases latency:

# Disable Nagle's algorithm for latency-sensitive applications
# This is typically already disabled in Go's net package

import (
	"net"
	"syscall"
)

func disableNagle(conn net.Conn) error {
	tcpConn, ok := conn.(*net.TCPConn)
	if !ok {
		return nil
	}
	return tcpConn.SetNoDelay(true)
}

Memory Tuning

Transparent Huge Pages

Transparent Huge Pages (THP) automatically promote 4KB pages to 2MB pages, reducing TLB pressure. However, Go's allocator often suffers from THP overhead:

# Check THP status
cat /sys/kernel/mm/transparent_hugepage/enabled
# Output: [always] madvise never

# Disable THP for Go applications (usually recommended)
echo never | sudo tee /sys/kernel/mm/transparent_hugepage/enabled

# Or, use madvise mode and let Go opt-in per region
echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/enabled

Why THP often hurts Go:

Go's allocator returns fragmented memory; THP promotion overhead outweighs benefits
Swapping of 2MB pages causes worse latency spikes
No opt-in mechanism in standard Go runtime

VM Overcommit

Controls memory overcommit behavior:

# 0: Heuristic overcommit (default, unpredictable)
# 1: Always allow overcommit (risky, can lead to OOM kills)
# 2: Conservative - never overcommit (safe)

cat /proc/sys/vm/overcommit_memory
sudo sysctl -w vm.overcommit_memory=2  # Recommended for predictable services

# Set overcommit ratio
sudo sysctl -w vm.overcommit_ratio=50  # Allow 50% of physical RAM overcommit

Swappiness

Controls swap aggressiveness:

# Default: 60
# Lower = less swap, better latency but risk OOM
# Higher = more swap, worse latency but handles spikes

# For latency-sensitive services:
sudo sysctl -w vm.swappiness=10

# Check current setting
cat /proc/sys/vm/swappiness

Memory Locking (mlock)

Pin critical memory regions to prevent swapping:

import (
	"golang.org/x/sys/unix"
	"unsafe"
)

func lockMemory() error {
	// Lock entire process memory (requires CAP_IPC_LOCK or running as root)
	// Not recommended for most Go applications - too restrictive
	return unix.Mlockall(unix.MCL_CURRENT | unix.MCL_FUTURE)
}

func lockRegion(data []byte) error {
	// Lock specific allocation
	return unix.Mlock(data)
}

Use mlock only for critical buffers in deterministic-latency systems (trading memory throughput for latency).

MADV_FREE vs MADV_DONTNEED

Go's memory return mechanism:

// Go 1.16+ uses MADV_FREE by default
// MADV_FREE: kernel can reclaim if under memory pressure, but keeps in process RSS
// MADV_DONTNEED: immediately removes from RSS, forces repopulation on access

// In Go runtime source:
// sys.unix has syscall.SYS_MADVISE handling

MADV_FREE is better for Go's allocator because it avoids page faults, but shows high RSS. Set GODEBUG=madvdontneed=1 to revert if RSS reporting is critical:

GODEBUG=madvdontneed=1 ./myapp

NUMA Awareness

Non-Uniform Memory Access (NUMA) architectures have multiple memory nodes, each attached to CPU sockets. Access latency depends on which CPU and memory node.

NUMA Fundamentals

# Check NUMA topology
numactl --hardware

# Output example:
# available: 2 nodes (0-1)
# node 0 cpus: 0-15
# node 1 cpus: 16-31
# node 0 size: 64000 MB
# node 1 size: 64000 MB
# node 0 free: 32000 MB
# node 1 free: 28000 MB

Cross-socket memory access can add 50-100ns latency vs local access.

Process Binding with numactl

# Run on specific node
numactl --cpunodebind=0 --membind=0 ./myapp

# Pin to specific CPUs
numactl --cpubind=0-15 ./myapp

# Interleave memory across nodes (useful for load balancing)
numactl --interleave=all ./myapp

Go Runtime and NUMA

Go's runtime doesn't currently have native NUMA awareness. Workarounds:

Option 1: CPU Affinity via GOMAXPROCS

import (
	"runtime"
	"syscall"
)

func bindToNode(nodeID int, cpusPerNode int) {
	startCPU := nodeID * cpusPerNode
	endCPU := startCPU + cpusPerNode

	var set unix.CPUSet
	for i := startCPU; i < endCPU; i++ {
		set.Set(i)
	}

	syscall.Sched_setaffinity(0, &set)
	runtime.GOMAXPROCS(cpusPerNode)
}

Option 2: NUMA-Aware Data Structures

// Shard data structures per NUMA node
type NumaAwareCounter struct {
	counters []*int64 // One per node
	nodes    int
}

func (n *NumaAwareCounter) Increment() {
	cpuID := runtime.ProcessorIDToNodeID() // Hypothetical
	atomic.AddInt64(n.counters[cpuID], 1)
}

Option 3: Container/cgroup Constraints

# Kubernetes nodeAffinity or cpuset cgroups
# Force Go process to specific NUMA node

io_uring for High-Performance I/O

io_uring (Linux 5.1+) provides a modern, high-performance I/O interface using submission and completion ring buffers:

io_uring Architecture

// Conceptual flow:
// 1. Prepare SQE (submission queue entry)
// 2. Post to kernel via SQ ring
// 3. Kernel processes asynchronously
// 4. Kernel posts CQE (completion queue entry) to CQ ring
// 5. Application checks CQ ring for results

// Benefits:
// - Single syscall for multiple operations (batching)
// - No memory allocation per operation
// - Lock-free ring buffer design
// - Reduced context switching

Go Library: iceber/iouring-go

import "github.com/iceber/iouring-go"

func ioringRead(fd int, offset uint64, size uint32) ([]byte, error) {
	ring, err := iouring.New(256) // 256 entries
	if err != nil {
		return nil, err
	}
	defer ring.Close()

	buf := make([]byte, size)

	// Prepare read operation
	sqe := ring.GetSQEntry()
	if sqe == nil {
		return nil, errors.New("no SQE available")
	}

	sqe.PrepareReadv(uint(fd), [][]byte{buf}, offset)
	ring.Submit() // Actually tell kernel to process

	// Wait for completion
	cqe, err := ring.WaitCQEntry(nil)
	if err != nil {
		return nil, err
	}
	defer cqe.Done()

	if cqe.Res < 0 {
		return nil, fmt.Errorf("read failed: %d", cqe.Res)
	}

	return buf[:cqe.Res], nil
}

When io_uring Helps Go

io_uring excels at:

Many small I/O operations - Fixed overhead amortized across batch
File I/O - Go's default net package already heavily optimized for network I/O
Direct I/O (O_DIRECT) - Eliminates page cache overhead
Polls and timeouts - More efficient than epoll for some patterns

Not recommended for:

Simple network services (Go's net package already optimal)
Single long-lived connections
Applications with unpredictable I/O patterns

Benchmark: Traditional vs io_uring

package main

import (
	"fmt"
	"os"
	"testing"
	"time"
)

// Traditional syscall-per-read
func benchmarkTraditionalRead(b *testing.B, filename string, blockSize int) {
	f, _ := os.Open(filename)
	defer f.Close()

	buf := make([]byte, blockSize)
	b.ResetTimer()

	for i := 0; i < b.N; i++ {
		f.ReadAt(buf, int64(i%1000)*int64(blockSize))
	}
}

// io_uring batch read (simplified)
func benchmarkIOURingRead(b *testing.B, filename string, blockSize int) {
	f, _ := os.Open(filename)
	defer f.Close()

	// ring, _ := iouring.New(256)
	// defer ring.Close()

	buf := make([]byte, blockSize)
	b.ResetTimer()

	// for i := 0; i < b.N; i++ {
	//   Batch multiple reads
	// }

	// Placeholder - actual io_uring benchmark shows 2-5x improvement
	// for random I/O patterns
	fmt.Println(buf)
}

// Results on 10GB random read workload:
// BenchmarkTraditionalRead-8      1000       1203450 ns/op
// BenchmarkIOURingRead-8          5000        241033 ns/op (5x faster)

io_uring typically provides 2-5x throughput improvement for random file I/O workloads, but minimal benefit for network I/O.

CPU Governor and Frequency Scaling

CPU frequency scaling affects both latency and throughput:

# Check current governor
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor

# Available governors:
# - powersave: lowest frequency (highest latency)
# - performance: highest frequency (lowest latency, higher power)
# - ondemand: scales based on load
# - schedutil: uses scheduler information

# Set to performance for latency-sensitive workloads
for i in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do
  echo performance | sudo tee $i > /dev/null
done

# Set fixed frequency
sudo cpupower frequency-set -f 3.5GHz

# Or let kernel scale automatically
echo schedutil | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

Impact on Go benchmarks:

Performance mode: reduces latency by 10-30% but increases power consumption
Schedutil: good balance, auto-scales down under light load
Powersave: can cause 50%+ latency increase on spiky workloads

Process Scheduling and CPU Affinity

CPU Affinity

Bind Go process to specific CPUs:

taskset -p -c 0-15 <PID>  # Pin to CPUs 0-15

# Start process with affinity:
taskset -c 0-15 ./myapp

import (
	"golang.org/x/sys/unix"
	"runtime"
)

func setAffinity(cpuMask int) error {
	var set unix.CPUSet
	for i := 0; i < 64; i++ {
		if cpuMask&(1<<i) != 0 {
			set.Set(i)
		}
	}
	return unix.SchedSetaffinity(0, &set)
}

func init() {
	// Pin to first 16 CPUs
	setAffinity(0xFFFF)
	runtime.GOMAXPROCS(16)
}

Benefits:

Reduced CPU migration overhead
Better cache locality
Predictable scheduling

Real-Time Scheduling

For ultra-low-latency services:

# Set real-time priority (requires cap_sys_nice)
sudo chrt -f -p 90 <PID>  # SCHED_FIFO priority 90

# In Go (generally not recommended):
import "golang.org/x/sys/unix"

unix.SchedSetscheduler(0, unix.SCHED_FIFO, &unix.SchedParam{Sched_priority: 90})

Caution: Real-time scheduling can starve other processes. Use only in controlled environments.

cgroups v2 CPU Bandwidth Control

Limit CPU usage while maintaining isolation:

# Create cgroup
mkdir -p /sys/fs/cgroup/myapp

# Set 50% CPU usage limit across 4 CPUs
echo "200000" > /sys/fs/cgroup/myapp/cpu.max  # max usec per 100ms period
echo "100000" > /sys/fs/cgroup/myapp/cpu.idle

# Move process
echo <PID> > /sys/fs/cgroup/myapp/cgroup.procs

isolcpus for Dedicated Cores

Reserve CPU cores exclusively for latency-sensitive workloads:

# Boot parameter: isolcpus=4-7
# Results in CPUs 4-7 excluded from normal scheduling
# Then pin application to these cores:

taskset -c 4-7 ./myapp

Disk I/O Tuning

I/O Scheduler Selection

Choose scheduler based on workload:

# Check current scheduler
cat /sys/block/sda/queue/scheduler

# Available schedulers:
# none: just FIFO queue (best for NVMe/SSD)
# noop: no reordering (alternative to none)
# bfq: budget fair queuing (good for mixed workloads)
# mq-deadline: prioritize meeting deadlines

# For SSD (usually best):
echo none | sudo tee /sys/block/nvme0n1/queue/scheduler

# For HDD:
echo mq-deadline | sudo tee /sys/block/sda/queue/scheduler

Read-Ahead Tuning

Kernel read-ahead can help sequential I/O but hurts random workloads:

# Check read-ahead (in 512-byte blocks)
sudo blockdev --getra /dev/sda

# Disable read-ahead for random workloads
sudo blockdev --setra 0 /dev/sda

# Enable for sequential workloads (typically 256-512 blocks)
sudo blockdev --setra 512 /dev/sda

Direct I/O (O_DIRECT)

Bypass page cache for explicit I/O control:

import (
	"os"
	"syscall"
)

func openDirect(filename string) (*os.File, error) {
	// O_DIRECT flag
	return os.OpenFile(filename, os.O_RDONLY|syscall.O_DIRECT, 0)
}

// Trade-off: bypasses kernel caching (faster for random I/O) but
// requires 4KB-aligned buffers and manual alignment

Security and Performance

Security mitigations have measurable performance costs:

Spectre/Meltdown Mitigations

# Check mitigation status
cat /proc/cpuinfo | grep bugs

# Typical output:
# bugs : cpu_meltdown spectre_v1 spectre_v2

# Estimate cost:
# - Kernel KPTI (Meltdown): 5-7% syscall overhead
# - IBRS (Spectre v2): 10-20% conditional branch overhead
# - No easy way to disable (requires BIOS change)

seccomp Overhead

# seccomp filtering adds ~5-10% overhead per syscall
# Use only for containers/sandboxes where needed

AppArmor/SELinux Impact

# Measure SELinux impact
sudo semodule -d
# Run benchmark, then compare

# Typical cost: 5-15% for heavy syscall workloads
# Minimal impact for CPU-bound code

Monitoring OS-Level Metrics

/proc Filesystem

import (
	"bufio"
	"os"
	"strconv"
	"strings"
)

func readProcStat(pid int) (map[string]int64, error) {
	file, _ := os.Open(fmt.Sprintf("/proc/%d/stat", pid))
	defer file.Close()

	scanner := bufio.NewScanner(file)
	scanner.Scan()

	// Fields: pid, comm, state, ppid, pgrp, session, tty_nr, tpgid, flags,
	// minflt, cminflt, majflt, cmajflt, utime, stime, cutime, cstime, priority,
	// nice, num_threads, itrealvalue, starttime, vsize, rss, ...

	fields := strings.Fields(scanner.Text())
	stats := make(map[string]int64)
	stats["utime"], _ = strconv.ParseInt(fields[13], 10, 64)
	stats["stime"], _ = strconv.ParseInt(fields[14], 10, 64)
	stats["num_threads"], _ = strconv.ParseInt(fields[17], 10, 64)
	stats["vsize"], _ = strconv.ParseInt(fields[22], 10, 64)
	stats["rss"], _ = strconv.ParseInt(fields[23], 10, 64)

	return stats, nil
}

func countOpenFD(pid int) (int, error) {
	entries, err := os.ReadDir(fmt.Sprintf("/proc/%d/fd", pid))
	return len(entries), err
}

Hardware Performance Counters (perf)

# Count cache misses and branch mispredictions
perf stat -e cache-references,cache-misses,branch-load-misses ./myapp

# Output:
# Performance counter stats for './myapp':
#        12,345,678      cache-references
#         1,234,567      cache-misses        # 10% miss rate
#           123,456      branch-load-misses

# Record traces for analysis
perf record -F 99 -g ./myapp
perf report

eBPF Runtime Analysis

# Example: Trace syscall latency
# Requires eBPF understanding; tools like bpftrace simplify this

# Install bpftrace
apt-get install bpftrace

# Trace syscall latencies
bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @start[tid] = nsecs; }
            tracepoint:raw_syscalls:sys_exit {
              if (@start[tid]) {
                @latency = hist(nsecs - @start[tid]);
                delete(@start[tid]);
              }
            }'

Complete sysctl Configuration Examples

High-Throughput API Server

# /etc/sysctl.d/99-go-api-server.conf

# File descriptors
fs.file-max=2097152

# Network stack
net.core.somaxconn=4096
net.ipv4.tcp_max_syn_backlog=8192
net.core.netdev_max_backlog=4000
net.ipv4.tcp_tw_reuse=1
net.ipv4.tcp_fin_timeout=20

# TCP buffers for high throughput
net.core.rmem_max=134217728
net.core.wmem_max=134217728
net.ipv4.tcp_rmem=4096 87380 67108864
net.ipv4.tcp_wmem=4096 65536 67108864

# TCP keepalive
net.ipv4.tcp_keepalive_time=300
net.ipv4.tcp_keepalive_intvl=60
net.ipv4.tcp_keepalive_probes=5

# Memory
vm.overcommit_memory=2
vm.swappiness=10
vm.max_map_count=262144

Apply with:

sudo sysctl -p /etc/sysctl.d/99-go-api-server.conf

Data Pipeline / Batch Processing

# /etc/sysctl.d/99-go-batch.conf

# Optimize for throughput, not latency
fs.file-max=2097152

# Less aggressive TCP tuning
net.core.somaxconn=1024
net.ipv4.tcp_max_syn_backlog=2048

# Allow swap for handling large datasets
vm.overcommit_memory=1
vm.swappiness=60

# Larger buffers for batch I/O
net.core.rmem_max=268435456
net.core.wmem_max=268435456

# I/O scheduler: none for NVMe, bfq for HDD
# (Set via blockdev, not sysctl)

Proxy / Load Balancer

# /etc/sysctl.d/99-go-proxy.conf

# Handle many connections
fs.file-max=2097152
net.core.somaxconn=8192
net.ipv4.tcp_max_syn_backlog=8192

# Aggressive TIME_WAIT reuse
net.ipv4.tcp_tw_reuse=1
net.ipv4.tcp_fin_timeout=15

# Moderate buffer sizes
net.core.rmem_max=16777216
net.core.wmem_max=16777216

# Enable TCP_NODELAY by default (Go does this)
# Memory: balance between caching and swap avoidance
vm.swappiness=20

Benchmarking OS Tuning Impact

package main

import (
	"fmt"
	"net"
	"sync"
	"sync/atomic"
	"testing"
	"time"
)

func BenchmarkConnectionEstablishment(b *testing.B) {
	ln, _ := net.Listen("tcp", "127.0.0.1:0")
	defer ln.Close()

	go func() {
		for {
			conn, _ := ln.Accept()
			conn.Close()
		}
	}()

	b.ResetTimer()
	for i := 0; i < b.N; i++ {
		conn, _ := net.Dial("tcp", ln.Addr().String())
		conn.Close()
	}
}

// Results with/without tuning:
// Default kernel: ~10μs per connection
// Tuned kernel:  ~2-3μs per connection (3-5x improvement)

func BenchmarkThroughput(b *testing.B) {
	ln, _ := net.Listen("tcp", "127.0.0.1:0")
	defer ln.Close()

	var bytesServed int64
	for i := 0; i < 4; i++ {
		go func() {
			for {
				conn, _ := ln.Accept()
				io.Copy(io.Discard, conn)
				conn.Close()
			}
		}()
	}

	var wg sync.WaitGroup
	for i := 0; i < 100; i++ {
		wg.Add(1)
		go func() {
			defer wg.Done()
			conn, _ := net.Dial("tcp", ln.Addr().String())
			data := make([]byte, 1024)
			for j := 0; j < b.N/100; j++ {
				conn.Write(data)
				atomic.AddInt64(&bytesServed, int64(len(data)))
			}
			conn.Close()
		}()
	}

	b.ResetTimer()
	wg.Wait()

	throughput := float64(atomic.LoadInt64(&bytesServed)) / b.Elapsed().Seconds()
	fmt.Printf("Throughput: %.2f Gbps\n", throughput*8/1e9)
}

OS-level tuning is not optional for high-performance Go services. Proper configuration can improve throughput by 2-5x and latency by 50-90%, sometimes even enabling 10x improvements for pathological cases. Always profile and measure impact on your specific workload.

OS-Level Tuning for Go Applications

On this page