Observability Overhead and Optimization
Managing the performance cost of observability — OpenTelemetry, Prometheus metrics, distributed tracing, and strategies for minimizing monitoring overhead in Go services.
Observability is essential for production systems, but it has real costs. Every metric recorded, every span created, every log written consumes CPU cycles, memory, and network bandwidth. At scale, naive observability implementations can reduce application throughput by 10-30%. This article explores the performance implications of different observability approaches and strategies to minimize overhead while maintaining insight.
The Observability Tax
Observability costs accumulate in several dimensions:
CPU Overhead
Metric computation and exporting consume significant CPU:
// Naive metric recording per request
func handleRequest(w http.ResponseWriter, r *http.Request) {
start := time.Now()
defer func() {
// This runs for EVERY request
duration := time.Since(start).Seconds()
histogram.Observe(duration) // CPU cost: ~5-20μs per operation
counter.Inc() // CPU cost: ~1-5μs per operation
}()
// ... handle request
}
// At 100k RPS:
// - Histogram observation: 100k * 15μs = 1.5s CPU per second
// - Likely 5-10% of total application CPUTypical CPU costs:
- Counter increment: 1-5 microseconds
- Histogram observation: 5-20 microseconds
- Span creation: 20-100 microseconds
- Context propagation: 5-15 microseconds
- Sampled trace export: 50-500 microseconds (if sampled)
Memory Pressure
Observability systems allocate memory for buffers, trace contexts, and metrics:
// Per-request allocations
type TraceContext struct {
TraceID [16]byte
SpanID [8]byte
ParentID [8]byte
Flags uint8
Attributes map[string]interface{} // High allocation pressure!
}
// For 100k RPS with tracing, this adds:
// - Context allocation: 100k * 200 bytes = 20MB/s allocation rate
// - GC pauses increase proportionallyBatch span processor buffers:
processor := sdktrace.NewBatchSpanProcessor(exporter,
sdktrace.WithMaxQueueSize(2048), // Memory per queue
sdktrace.WithMaxExportBatchSize(512),
)
// Memory cost: 2048 spans * 1KB per span ≈ 2MB baselineHistogram buckets:
// Example: Latency histogram with excessive buckets
histogram := prometheus.NewHistogram(prometheus.HistogramOpts{
Buckets: []float64{.001, .01, .025, .05, .075, .1, .25, .5, .75, 1, 2.5, 5, 10, 25, 50, 100},
// 16 buckets + 1 (infinity) + 1 (sum) = 18 slots per time series
})
// Memory cost: buckets * cardinality
// With 10 labels: 18 * (10 choose 10) * 8 bytes = 1.4KB per metricNetwork Bandwidth
Exporting observability data consumes network capacity:
Prometheus scrape (per metric):
- Counter: ~50 bytes
- Histogram (16 buckets): ~800 bytes
- Gauge: ~50 bytes
100k metrics per service * 50 bytes = 5MB per scrape
Every 15 seconds = ~2.7 Mbps baseline
OTLP trace export (per span):
- Minimal span: ~200 bytes
- Rich span with attributes: ~2KB
- 10k spans/sec * 1KB = ~80 Mbps (before compression)I/O for Logging
Synchronous logging writes block the request path:
// Blocking log write
func (h *Handler) logRequest(r *http.Request) {
h.logger.Info("request",
zap.String("method", r.Method),
zap.String("path", r.URL.Path),
zap.String("client", r.RemoteAddr),
// ...
)
// Cost: 100-1000μs per call (disk I/O)
}Disk write latencies:
- SSD: 50-500 microseconds
- HDD: 1-10 milliseconds
- Network filesystem: 10-100+ milliseconds
Prometheus Client Performance
The Prometheus Go client is optimized but still has measurable costs.
Architecture Overview
// prometheus/client_golang structure
type Registry struct {
mtx sync.RWMutex
collectorsByID map[uint64]Collector
collectorsByAux map[string][]Collector
// ...
}
type Metric interface {
Describe(chan<- *Desc)
Collect(chan<- Metric)
}
// Examples:
// - Counter: Atomic increment
// - Gauge: RwMutex-protected value
// - Histogram: Array of buckets with RwMutex
// - Summary: Streaming quantiles (more expensive)Counter vs Gauge vs Histogram vs Summary
Performance characteristics:
import (
"sync/atomic"
"testing"
)
var (
counter = prometheus.NewCounter(prometheus.CounterOpts{
Name: "requests_total",
})
gauge = prometheus.NewGauge(prometheus.GaugeOpts{
Name: "temperature_celsius",
})
histogram = prometheus.NewHistogram(prometheus.HistogramOpts{
Name: "request_duration_seconds",
Buckets: prometheus.DefBuckets, // 11 buckets
})
summary = prometheus.NewSummary(prometheus.SummaryOpts{
Name: "request_duration_seconds",
Objectives: map[float64]float64{0.5: 0.05, 0.9: 0.01, 0.99: 0.001},
})
)
// Benchmark results (ns/operation):
// BenchmarkCounterInc 10 // atomic add
// BenchmarkGaugeSet 100 // mutex + atomic
// BenchmarkHistogramObserve 500 // lock + binary search + array update
// BenchmarkSummaryObserve 1000 // streaming quantile computation
func BenchmarkMetricsRecording(b *testing.B) {
tests := []struct {
name string
fn func()
}{
{"Counter", func() { counter.Inc() }},
{"Gauge", func() { gauge.Set(42.0) }},
{"Histogram", func() { histogram.Observe(0.123) }},
{"Summary", func() { summary.Observe(0.123) }},
}
for _, test := range tests {
b.Run(test.name, func(b *testing.B) {
b.ReportAllocs()
for i := 0; i < b.N; i++ {
test.fn()
}
})
}
}
// Results on modern hardware:
// Counter: 10-15 ns/op, 0 allocs
// Gauge: 80-120 ns/op, 0 allocs
// Histogram: 400-600 ns/op, 0 allocs
// Summary: 800-1200 ns/op, 0 allocsKey insight: Histogram is 40-60x slower than counter but still acceptable. Summary should be avoided in hot paths.
Label Cardinality: The Performance Killer
High-cardinality labels (many unique values) cause exponential memory and CPU growth:
// DANGER: User ID as label (100k+ unique values)
requestLatency := prometheus.NewHistogramVec(
prometheus.HistogramOpts{Name: "request_latency_seconds"},
[]string{"method", "path", "user_id"}, // user_id is high-cardinality!
)
// Memory cost per unique combination:
// 1000 methods * 100 paths * 100000 users = 10 billion metric series
// Each histogram with 11 buckets: 10B * 11 * 8 bytes = 880GB memory!
// SAFE: Numeric user ID as metric attribute (not label)
requestLatency := prometheus.NewHistogram(
prometheus.HistogramOpts{Name: "request_latency_seconds"},
)
// Later, when exporting, add exemplar with user ID:
histogram.Observe(duration, exemplar.WithAttributes(
attribute.String("user_id", userID),
))Cardinality explosion warning signs:
- Scrape duration increasing (>10 seconds)
- Prometheus memory usage growing (>2GB)
- High CPU utilization during scrape
Detection in Go:
func monitorMetricCardinality(reg prometheus.Registerer) {
// Custom collector to track metric cardinality
type card struct {
prometheus.Collector
}
c := &card{}
reg.MustRegister(c)
}
// Better: Use prometheus_tsdb_metric_chunks_created metric
// in Prometheus itself to track growthRules for label cardinality:
- Method: ~10-20 values (safe)
- Path: ~50-200 values (usually safe)
- Status code: 3-5 values (safe)
- Host/instance: ~10-100 values (safe)
- User ID: 100k+ values (NEVER as label)
- Customer/tenant ID: 1k+ values (only if tenants < 100)
- Request ID: unlimited values (NEVER as label)Histogram Bucket Configuration
The number of buckets dramatically affects memory and scrape time:
// Default buckets (11 + Inf)
prometheus.DefBuckets
// [.005, .01, .025, .05, .075, .1, .25, .5, .75, 1, 2.5, 5, +Inf]
// Custom buckets for tight latency control
tightLatencyBuckets := prometheus.LinearBuckets(0.001, 0.001, 100) // 0.001s to 0.1s
// Memory impact:
// - 11 buckets: 11 * 8 bytes = 88 bytes per time series
// - 100 buckets: 100 * 8 bytes = 800 bytes per time series
// - 1000 buckets: 1000 * 8 bytes = 8KB per time series
// With 1000 metric series:
// - 11 buckets: 88KB total
// - 100 buckets: 800KB total
// - 1000 buckets: 8MB totalBest practices:
- Use default buckets for most workloads
- Custom buckets only when you need precision in specific ranges
- Limit custom buckets to 20-50 for latency histograms
- Avoid histogram for unbounded dimensions (IDs, names)
Custom Collectors vs Auto-Registered Metrics
Custom collectors defer computation until scrape time:
// Lazy collection (recommended for expensive metrics)
type CustomCollector struct{}
func (c *CustomCollector) Describe(ch chan<- *prometheus.Desc) {
ch <- prometheus.NewDesc(
"expensive_metric_total",
"Only computed during scrape",
nil, nil,
)
}
func (c *CustomCollector) Collect(ch chan<- prometheus.Metric) {
// This function runs once every 15 seconds (at scrape time)
// Not on every request
result := expensiveComputation()
ch <- prometheus.MustNewConstMetric(
prometheus.NewDesc(...),
prometheus.GaugeValue,
float64(result),
)
}
// Register once
prometheus.MustRegister(&CustomCollector{})
// vs.
// Eager recording (expensive in hot path)
expensiveMetric := prometheus.NewGauge(prometheus.GaugeOpts{
Name: "expensive_metric",
})
func handleRequest() {
// This runs per-request!
expensiveMetric.Set(float64(expensiveComputation()))
}When to use custom collectors:
- Metrics based on system calls (/proc, cgroups)
- Database queries
- Complex aggregations
- Metrics computed from other metrics
Avoid custom collectors for:
- Per-request metrics (latency, counts)
- High-frequency updates
- Metrics that need per-instance precision
Scrape Duration Optimization
Prometheus scrape time impacts your application:
# Monitor scrape duration in Prometheus
# Query: increase(scrape_duration_seconds_sum[5m])
# Slow scrape causes:
# - Full GC before scrape (if metrics cause allocation)
# - Repeated metric gathering
# - Network serialization overhead
# Example scrape sizes:
# 10 metrics: <1KB
# 1k metrics: ~50KB
# 100k metrics: ~5MB (scrape takes >1s!)// Measure metrics generation time
import "github.com/prometheus/client_golang/prometheus"
type Gatherer interface {
Gather() ([]*dto.MetricFamily, error)
}
func measureGatherTime(g prometheus.Gatherer) {
start := time.Now()
families, _ := g.Gather()
duration := time.Since(start)
totalSeries := 0
for _, family := range families {
totalSeries += len(family.Metric)
}
log.Printf("Scrape took %v for %d series\n", duration, totalSeries)
// Expect: <100ms for 10k series
// Warn: >1s for 100k series
}OpenTelemetry Performance
OpenTelemetry provides standards-based observability but introduces overhead. The SDK architecture significantly impacts performance.
SDK Architecture
// Simplified OTel flow
// 1. Tracer creates spans
// 2. SpanProcessor processes/exports spans
// 3. Exporter sends to backend
type TracerProvider struct {
activeSpanProcessor []SpanProcessor
// ...
}
type SpanProcessor interface {
OnStart(ctx context.Context, span ReadWriteSpan)
OnEnd(span ReadOnlySpan)
Shutdown(ctx context.Context) error
}
type Exporter interface {
ExportSpans(ctx context.Context, spans []ReadOnlySpan) error
Shutdown(ctx context.Context) error
}
// Two primary SpanProcessors:
// 1. SimpleSpanProcessor: Synchronous, blocking
type SimpleSpanProcessor struct {
exporter SpanExporter
}
// OnEnd immediately calls exporter (blocks the span!)
// 2. BatchSpanProcessor: Asynchronous buffering
type BatchSpanProcessor struct {
queue chan *SpanSnapshot
batchSize int
batchTimeout time.Duration
exporter SpanExporter
}
// OnEnd queues span, background worker batches and exportsSimpleSpanProcessor vs BatchSpanProcessor
Performance comparison:
import (
"context"
"fmt"
"go.opentelemetry.io/sdk/trace"
"go.opentelemetry.io/sdk/trace/tracetest"
"testing"
"time"
)
func benchmarkSpanProcessor(b *testing.B, processor trace.SpanProcessor) {
tp := trace.NewTracerProvider(
trace.WithSpanProcessor(processor),
)
defer tp.Shutdown(context.Background())
tracer := tp.Tracer("bench")
b.ResetTimer()
b.ReportAllocs()
for i := 0; i < b.N; i++ {
ctx, span := tracer.Start(context.Background(), "operation")
// Simulate work
span.AddEvent("step1")
span.SetAttributes(...)
span.End()
}
}
// Results:
// SimpleSpanProcessor:
// BenchmarkSimple 50000 25000 ns/op (25μs per span)
// - Synchronous: blocks until export completes
// - Exporter latency directly impacts request latency
// - Network timeout = request timeout
// BatchSpanProcessor:
// BenchmarkBatch 1000000 1200 ns/op (1.2μs per span)
// - Asynchronous: returns immediately
// - Buffering adds memory
// - Export failures don't block requests
// Difference: 20x faster with batchingRecommendation: Always use BatchSpanProcessor in production.
Sampling Strategies
Sampling is critical for cost control. Without sampling, distributed tracing can generate terabytes of data daily.
Sampling approaches:
import (
"go.opentelemetry.io/sdk/trace"
"go.opentelemetry.io/sdk/trace/tracetest"
)
// 1. AlwaysOn: Every request traced
// CPU cost: 100% of tracing overhead
// Data cost: Proportional to traffic
sampler1 := trace.AlwaysOnSampler()
// 2. AlwaysSample: Never trace (mistake, typo sometimes!)
sampler2 := trace.AlwaysOffSampler()
// 3. TraceIDRatioBased: Sample by trace ID hash
// Cost: Proportional to sampling ratio
// Sampler divides trace ID by ratio
sampler3 := trace.TraceIDRatioBased(0.1) // 10% sampling
// 4. ParentBased: Respect parent sampling decision
// Cost: Varies by upstream decision
sampler4 := trace.ParentBased(
trace.TraceIDRatioBased(0.1),
// If parent sampled=true, use 100%
// If parent sampled=false, use 0%
)Head vs Tail Sampling
Head sampling (at request start):
// Decide before processing request
type HeadSampler struct {
ratio float64
}
func (s *HeadSampler) ShouldSample(parameters SamplingParameters) SamplingDecision {
return SamplingDecision{
Sample: random() < s.ratio,
// Can't use result of request (not available yet)
}
}
// Pro: Low overhead, samples evenly
// Con: Can't sample errors preferentiallyTail sampling (after request completes):
// Decide after knowing outcome
func (s *ServiceController) exportSpans(spans []ReadOnlySpan) {
for _, span := range spans {
if shouldExport(span) { // Check attributes, duration, errors
s.exporter.ExportSpans(context.Background(), []ReadOnlySpan{span})
}
}
}
func shouldExport(span ReadOnlySpan) bool {
// Sample errors with 100% probability
if span.Status().Code == codes.Error {
return true
}
// Sample slow requests with 10% probability
duration := span.EndTime().Sub(span.StartTime())
if duration > 1*time.Second && random() < 0.1 {
return true
}
// Sample 1% of normal requests
return random() < 0.01
}
// Pro: Smarter sampling, captures interesting cases
// Con: Still processes all spans in memory before decisionAdaptive Sampling for Cost Control
// Adjust sampling ratio based on current cost
type AdaptiveSampler struct {
currentRatio float64
targetQPS int64
actualQPS int64
mu sync.RWMutex
}
func (s *AdaptiveSampler) AdjustRatio() {
s.mu.Lock()
defer s.mu.Unlock()
if s.actualQPS > s.targetQPS {
// Reduce sampling ratio
s.currentRatio *= 0.95
} else if s.actualQPS < s.targetQPS/2 {
// Increase sampling ratio
s.currentRatio *= 1.05
}
// Clamp to [0, 1]
if s.currentRatio > 1.0 {
s.currentRatio = 1.0
}
}
func (s *AdaptiveSampler) ShouldSample(parameters SamplingParameters) SamplingDecision {
s.mu.RLock()
ratio := s.currentRatio
s.mu.RUnlock()
return SamplingDecision{
Sample: random() < ratio,
}
}
// Automatically maintain fixed tracing volume regardless of trafficSpan Attribute Costs
Adding attributes allocates memory:
import "sync"
// Span attribute storage
type recordedAttribute struct {
key string
value interface{}
}
// Each attribute allocation:
// - String key: 16 bytes (pointer)
// - Interface value: 16 bytes (type + value)
// - Map entry: ~56 bytes overhead
// Total: ~100 bytes per attribute
span.SetAttributes(
attribute.String("user_id", userID), // +100 bytes
attribute.Int("items_count", count), // +100 bytes
attribute.String("request_path", path), // +100 bytes
)
// For 100k requests/sec with 10 attributes:
// 100k * 10 * 100 bytes = 100MB/sec allocation
// Significant GC pressure!
// Optimization: Use only essential attributes
span.SetAttribute("user_type", userType) // Categorical: few unique values
// Skip: user_id, request_id, timestamps (queryable separately)Context Propagation Overhead
Propagating trace context across calls:
// W3C Trace Context header example:
// traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01
func (p *W3CTraceContextPropagator) Inject(ctx context.Context, carrier TextMapCarrier) {
// Allocate headers, format strings: ~1-5μs per call
}
func (p *W3CTraceContextPropagator) Extract(ctx context.Context, carrier TextMapCarrier) {
// Parse headers, validate: ~1-5μs per call
}
// For 100k requests/sec:
// 100k * 5μs = 0.5 seconds CPU per second = 5% overhead
// Optimization: Use fast binary context propagation if possible
// Jaeger binary propagation: ~1μs (vs 5μs text)OTel Collector: Agent vs Gateway Mode
Agent mode (sidecar on each host):
# otel-collector-config.yaml
receivers:
otlp:
protocols:
grpc:
endpoint: localhost:4317 # Local, low latency
exporters:
jaeger:
endpoint: jaeger-gateway:14250 # Single remote endpoint
service:
pipelines:
traces:
receivers: [otlp]
exporters: [jaeger]Cost: Low latency to local collector, but many collector instances
Gateway mode (centralized):
# Applications export directly to gateway
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317 # Accept from any app
exporters:
jaeger:
endpoint: jaeger-storage:14250Cost: Higher latency from app to remote collector, but fewer instances
Benchmark:
Agent mode (local collector):
- Export latency: <1ms
- Network: 1k apps * 10MB/sec = 10GB/sec local traffic
Gateway mode (centralized):
- Export latency: 50-100ms
- Network: 10GB/sec across datacenterBenchmark: Traced vs Untraced Latency
import (
"context"
"go.opentelemetry.io/sdk/trace"
"testing"
)
func BenchmarkWithoutTracing(b *testing.B) {
b.ReportAllocs()
for i := 0; i < b.N; i++ {
ctx := context.Background()
doWork(ctx)
}
}
func BenchmarkWithSimpleTracing(b *testing.B) {
tp := trace.NewTracerProvider(
trace.WithSpanProcessor(
trace.NewSimpleSpanProcessor(&noopExporter{}),
),
)
defer tp.Shutdown(context.Background())
tracer := tp.Tracer("bench")
b.ReportAllocs()
for i := 0; i < b.N; i++ {
ctx, span := tracer.Start(context.Background(), "operation")
doWork(ctx)
span.End()
}
}
func BenchmarkWithBatchTracing(b *testing.B) {
tp := trace.NewTracerProvider(
trace.WithSpanProcessor(
trace.NewBatchSpanProcessor(&noopExporter{}),
),
)
defer tp.Shutdown(context.Background())
tracer := tp.Tracer("bench")
b.ReportAllocs()
for i := 0; i < b.N; i++ {
ctx, span := tracer.Start(context.Background(), "operation")
doWork(ctx)
span.End()
}
}
// Results:
// BenchmarkWithoutTracing 100000 9500 ns/op (baseline)
// BenchmarkWithSimpleTracing 20000 55000 ns/op (5.8x slower)
// BenchmarkWithBatchTracing 80000 11500 ns/op (1.2x slower)
// Key insight: Use BatchSpanProcessor, not SimpleSpanProcessorDistributed Tracing Optimization
Span Creation Overhead and Pooling
Creating spans allocates memory:
// Span creation cost: ~100-200ns per span
// Allocation: ~500 bytes per span
type recordedSpan struct {
spanContext SpanContext
startTime time.Time
endTime time.Time
attributes []recordedAttribute
events []recordedEvent
links []link
status Status
childSpanCount int32
// ... more fields
}
// Optimization: Sync.Pool for span recycling
var spanPool = sync.Pool{
New: func() interface{} {
return &recordedSpan{}
},
}
// Only helps with custom implementations
// Standard OTel SDK doesn't expose pooling APIWhen to Create Spans
Creating a span for every function is excessive:
// EXCESSIVE: Span per function
func handleRequest(w http.ResponseWriter, r *http.Request) {
_, span := tracer.Start(r.Context(), "handleRequest")
defer span.End()
data := fetchData() // Span?
result := process(data) // Span?
format := format(result) // Span?
write(w, format) // Span?
}
// OPTIMAL: Span per meaningful operation
func handleRequest(w http.ResponseWriter, r *http.Request) {
ctx, span := tracer.Start(r.Context(), "handleRequest")
defer span.End()
ctx, dataSpan := tracer.Start(ctx, "fetch_data")
data := fetchData()
dataSpan.End()
result := process(data) // No span (internal logic)
ctx, formatSpan := tracer.Start(ctx, "format_response")
formatted := format(result)
formatSpan.End()
write(w, formatted)
}
// Rule: Create spans for external I/O and inter-service boundaries
// Skip: Internal function calls, CPU-bound logicSpan Events vs Child Spans
Events are lighter than child spans:
// Child span: Full overhead (~200ns, ~500 bytes memory)
_, span := tracer.Start(ctx, "processing_step")
span.AddAttribute("count", 100)
span.End()
// Event: Lightweight (~20ns, ~100 bytes memory)
span.AddEvent("processing_step", trace.WithAttributes(
attribute.Int("count", 100),
))
// Use events for high-frequency milestones
// Use spans for distinct operations with context
// Example:
// ✓ Span for database query
// ✓ Event for validation step within request
// ✗ Span for each line of codeLink vs Parent-Child
Links are heavier than parent-child relationships:
// Parent-child: Lightweight (context propagation)
_, childSpan := tracer.Start(parentCtx, "child_operation")
// Link: Requires separate span reference (more overhead)
ctx, span := tracer.Start(context.Background(), "unrelated")
span.AddLink(trace.Link{
SpanContext: parentCtx.SpanContext(),
Attributes: []attribute.KeyValue{...},
})
// Use links only when spans aren't directly hierarchical
// Example: Async response handling, batch processingMetrics vs Tracing vs Logging: Cost Comparison
Cost analysis for different observability approaches:
Operation: Track every HTTP request duration
1. METRICS (Histogram)
- CPU per operation: 500ns
- Memory per metric: 88 bytes
- Bandwidth (scrape): 800 bytes per time series
Cost: Very low
2. STRUCTURED LOGS
- CPU per operation: 10-100μs (serialization)
- Memory per log: 200-500 bytes
- Bandwidth (export): ~500 bytes per log
- Disk I/O (if synchronous): 50-1000μs
Cost: Low-medium
3. SAMPLING (10% of requests)
- CPU per operation: 20-30μs (span creation)
- Memory per span: 500 bytes
- Bandwidth (export): 1KB per span
- Overhead: 200MB/sec at 100k RPS * 10%
Cost: Medium
4. FULL TRACING (100% of requests)
- CPU per operation: 20-30μs (span creation)
- Memory per span: 500 bytes
- Bandwidth (export): 1KB per span
- Overhead: 2GB/sec at 100k RPS
Cost: Very high
RECOMMENDATION:
- Use metrics (RED: Rate, Errors, Duration) as baseline
- Add sampled tracing (1-5%) for debugging
- Reserve full tracing for development/testingRED Metrics as Low-Cost Alternative
RED methodology: Rate, Errors, Duration
var (
requestsTotal = prometheus.NewCounterVec(
prometheus.CounterOpts{
Name: "http_requests_total",
},
[]string{"method", "path", "status"},
)
requestDuration = prometheus.NewHistogramVec(
prometheus.HistogramOpts{
Name: "http_request_duration_seconds",
},
[]string{"method", "path"},
)
requestErrors = prometheus.NewCounterVec(
prometheus.CounterOpts{
Name: "http_requests_errors_total",
},
[]string{"method", "path", "error_type"},
)
)
// Per-request:
func handleRequest(w http.ResponseWriter, r *http.Request) {
start := time.Now()
defer func() {
// Cost: ~1μs total
duration := time.Since(start).Seconds()
requestDuration.WithLabelValues(r.Method, r.URL.Path).Observe(duration)
requestsTotal.WithLabelValues(r.Method, r.URL.Path, statusCode).Inc()
if err != nil {
requestErrors.WithLabelValues(r.Method, r.URL.Path, errorType).Inc()
}
}()
// ... handle request
}
// Can reconstruct most insights from RED metrics:
// - Latency distribution
// - Error rates
// - Throughput trends
// - Anomalies
// Cost: 100x lower than full tracingExemplars: Connecting Metrics to Traces
Exemplars link metrics to trace IDs, enabling selective drilling:
import (
"go.opentelemetry.io/otel/trace"
"github.com/prometheus/client_golang/prometheus"
)
func recordExemplar(span trace.Span, histogram prometheus.Histogram, value float64) {
// Get trace ID from span
traceID := span.SpanContext().TraceID().String()
// Record metric with trace ID as label
histogram.Observe(value)
// Exemplar: histogram sample with associated trace ID
// Prometheus stores one exemplar per bucket
}
// In Prometheus query:
// When you see a histogram bucket with high value,
// Click on it to jump directly to the corresponding trace
// Cost: Minimal (one exemplar per bucket per scrape)
// Benefit: Deterministic drilling from metrics to tracesGo Runtime Metrics
runtime/metrics Package (Go 1.16+)
Zero-allocation metric reading:
import "runtime/metrics"
func readRuntimeMetrics() {
// Get all available metrics
descs := metrics.All()
// Read specific metrics
samples := make([]metrics.Sample, len(descs))
for i := range descs {
samples[i].Name = descs[i].Name
}
metrics.Read(samples)
for _, sample := range samples {
switch sample.Value.Kind() {
case metrics.KindUint64:
fmt.Printf("%s: %d\n", sample.Name, sample.Value.Uint64())
case metrics.KindFloat64:
fmt.Printf("%s: %f\n", sample.Name, sample.Value.Float64())
}
}
}
// CPU cost: ~100-500μs to read all metrics
// Memory cost: 0 allocations
// Best for periodic sampling (every 10-30 seconds)
// Available metrics (partial list):
// /gc/heap/allocs:bytes
// /gc/heap/frees:bytes
// /gc/heap/goal:bytes
// /memory/classes/heap/alloc:bytes
// /memory/classes/heap/free:bytes
// /memory/classes/metadata/mspan:bytes
// /memory/classes/other:bytes
// /memory/classes/profiling/buckets:bytes
// /sync/mutex/wait/total:secondsdebug.ReadGCStats (Deprecated but Still Used)
Older alternative with higher cost:
import "runtime/debug"
func readOldGCStats() {
var stats debug.GCStats
debug.ReadGCStats(&stats)
// Returns:
stats.LastGC // Time of last GC
stats.NumGC // Number of GC runs
stats.PauseTotal // Total pause time
stats.Pause // Recent pause times (circular buffer)
stats.PauseEnd // Timestamps of pauses
stats.PauseQuantiles // [min, 25%, 50%, 75%, 100%]
}
// Overhead: Computes quantiles (~1000μs)
// Better to use runtime/metrics packageruntime.ReadMemStats: The Hidden GC Trigger
Calling ReadMemStats triggers a full GC:
import "runtime"
func badMetricsCollection() {
var m runtime.MemStats
runtime.ReadMemStats(&m) // STOP THE WORLD!
// GC pause added to this call!
fmt.Printf("Alloc: %v\n", m.Alloc)
fmt.Printf("TotalAlloc: %v\n", m.TotalAlloc)
}
// Hidden cost: Forces garbage collection
// Impact: 10-100ms pause on large heaps
// Safe alternative:
func goodMetricsCollection() {
// Use runtime/metrics
var sample metrics.Sample
sample.Name = "/memory/classes/heap/alloc:bytes"
metrics.Read([]metrics.Sample{sample})
// No GC trigger, 0 allocations
}
// Rule: Never call ReadMemStats in production hot pathsSampling Runtime Metrics Safely
var (
lastMemStatsRead time.Time
lastMemStats runtime.MemStats
mu sync.RWMutex
)
func startMetricsSampler() {
ticker := time.NewTicker(30 * time.Second)
go func() {
for range ticker.C {
// Sample every 30 seconds (not per-request)
// This is acceptable: GC pause every 30s
var m runtime.MemStats
runtime.ReadMemStats(&m)
mu.Lock()
lastMemStats = m
lastMemStatsRead = time.Now()
mu.Unlock()
// Record to metrics
memAllocBytes.Set(float64(m.Alloc))
gcPauses.Observe(float64(m.PauseNs[0]) / 1e9)
}
}()
}
func getMemStats() runtime.MemStats {
mu.RLock()
defer mu.RUnlock()
return lastMemStats
}
// Cost: GC pause every 30 seconds (not per-request)
// Provides up-to-30s stale data (acceptable for dashboards)Custom Metric Patterns
Hot-Path Friendly Counters
Atomic counters are lock-free:
import "sync/atomic"
// Atomic counter: no lock contention
var atomicCounter int64
func incrementAtomic() {
atomic.AddInt64(&atomicCounter, 1)
// Cost: ~1-2 nanoseconds
}
// Mutex counter: lock contention under high concurrency
var mu sync.Mutex
var mutexCounter int64
func incrementMutex() {
mu.Lock()
mutexCounter++
mu.Unlock()
// Cost: ~100-1000 nanoseconds (depends on contention)
}
// Sharded counter: balance between accuracy and performance
type ShardedCounter struct {
shards []*int64
}
func (c *ShardedCounter) Increment() {
shard := runtime.ProcessorIDToNodeID() % len(c.shards)
atomic.AddInt64(c.shards[shard], 1)
// Cost: ~2-5 nanoseconds, minimal contention
}
// At 100k RPS:
// Atomic: 100k * 1ns = 0.1ms CPU
// Mutex: 100k * 500ns = 50ms CPU (up to 50x difference!)
// Sharded: 100k * 3ns = 0.3ms CPU
// Recommendation:
// Use atomic.AddInt64 for single counter
// Use sharded counter for high-frequency, multi-core incrementsApproximate Counting: HyperLogLog
For cardinality estimation without storing all values:
import "github.com/axiomhq/hyperloglog"
func countUniqueUsers(userIDs []string) (uint64, error) {
hll := hyperloglog.New()
for _, userID := range userIDs {
hll.Insert([]byte(userID))
// Cost per insert: ~10-50ns
// Memory: 12KB fixed size
}
// Accuracy: ±2% for 1M+ users
cardinality := hll.Cardinality()
return cardinality, nil
}
// vs.
// Exact counting (map-based)
uniqueUsers := make(map[string]bool)
for _, userID := range userIDs {
uniqueUsers[userID] = true
// Cost per insert: ~100-500ns
// Memory: 1M users * 100 bytes = 100MB
}
cardinality := len(uniqueUsers)
// Trade-off: HyperLogLog is 10x faster with fixed memory
// Cost: 2% error toleranceHistogram Alternatives: HDR Histogram
More efficient latency recording:
import "github.com/HdrHistogram/hdrhistogram-go"
// Standard Prometheus histogram
func recordWithPrometheus(latency float64) {
histogram.Observe(latency) // ~500ns per call
}
// HDR Histogram
func recordWithHDR(latency int64) {
hdrHistogram.RecordValue(latency) // ~20-50ns per call
}
// Per-million requests:
// Prometheus: 1M * 500ns = 500ms CPU
// HDR: 1M * 30ns = 30ms CPU (16x faster)
// HDR limitations:
// - Only integers (nanoseconds)
// - Must reset periodically (or coordinate snapshots)
// - Less integration with Prometheus ecosystemRing Buffer for Recent Events
Capture recent events without unbounded allocation:
type EventRingBuffer struct {
buffer []*Event
writeIdx int
mu sync.RWMutex
}
func (rb *EventRingBuffer) Record(event *Event) {
rb.mu.Lock()
rb.buffer[rb.writeIdx] = event
rb.writeIdx = (rb.writeIdx + 1) % len(rb.buffer)
rb.mu.Unlock()
// Cost: ~100-200ns per event
}
func (rb *EventRingBuffer) GetRecent() []*Event {
rb.mu.RLock()
defer rb.mu.RUnlock()
return append([]*Event{}, rb.buffer...)
}
// Memory cost: Fixed (e.g., 1000 events * 200 bytes = 200KB)
// Use case: Keep last 1000 errors for debuggingProduction Patterns
Graceful Degradation Under Load
Disable observability when system is under stress:
import "runtime"
type LoadAwareObservability struct {
tracingEnabled bool
loggingLevel int
mu sync.RWMutex
}
func (lao *LoadAwareObservability) updateLoadStatus() {
ticker := time.NewTicker(1 * time.Second)
go func() {
for range ticker.C {
var m runtime.MemStats
runtime.ReadMemStats(&m)
gcRate := calculateGCRate()
cpuUsage := readProcCPU()
shouldDisableTracing := gcRate > 100 || cpuUsage > 90
lao.mu.Lock()
lao.tracingEnabled = !shouldDisableTracing
lao.mu.Unlock()
}
}()
}
func (lao *LoadAwareObservability) recordSpan(ctx context.Context, fn func(context.Context)) {
lao.mu.RLock()
enabled := lao.tracingEnabled
lao.mu.RUnlock()
if !enabled {
fn(ctx)
return
}
ctx, span := tracer.Start(ctx, "operation")
defer span.End()
fn(ctx)
}
// Benefit: Prevent observability from causing cascading failures
// Cost: Loss of visibility during incidents (trade-off)Metric Aggregation on Client Side
Pre-aggregate before export:
// Before: Export individual requests
// 100k RPS * 1KB per request = 100MB/sec
// After: Aggregate and export summaries
type MetricSummary struct {
Count int64
Sum float64
Min float64
Max float64
P50, P99 float64
Errors int64
}
func aggregateMetrics(interval time.Duration) {
ticker := time.NewTicker(interval)
go func() {
for range ticker.C {
// Collect metrics from current window
summary := MetricSummary{
Count: atomic.SwapInt64(&requestCount, 0),
Sum: atomicGetAndReset(&totalDuration),
P50: histogram.Quantile(0.50),
P99: histogram.Quantile(0.99),
Errors: atomic.SwapInt64(&errorCount, 0),
}
// Export summary (~500 bytes)
exportMetrics(summary)
}
}()
}
// Cost reduction: 100MB/sec → 500KB/sec (200x)Export Optimization: Delta vs Cumulative
Different temporality has different export costs:
CUMULATIVE (default):
- Export full value every scrape
- Prometheus server handles delta calculation
- Smaller payload if values don't change
DELTA:
- Export only change since last export
- Lower transmission overhead
- Server must maintain state for aggregation
Example:
Counter total: 1,000,000
CUMULATIVE export: {name: counter_total, value: 1000000}
DELTA export: {name: counter_total, value: +1000} (if 1000 increments since last export)
DELTA is 10x smaller if increments are much smaller than totalSidecar vs In-Process Collectors
Trade-offs:
IN-PROCESS COLLECTOR:
- Prometheus remote_write directly from app
- Cost: Memory in app process, network blocking
- Benefit: No separate infrastructure
- Example: prometheus/client_golang with remote_write
SIDECAR COLLECTOR:
- App exports to local sidecar
- Sidecar handles aggregation and export
- Cost: Network latency to sidecar
- Benefit: App doesn't manage collectorComplete Observability Optimization Example
package main
import (
"context"
"runtime"
"sync/atomic"
"time"
"go.opentelemetry.io/sdk/trace"
"github.com/prometheus/client_golang/prometheus"
)
type OptimizedObservability struct {
// Metrics: Low cost
requestTotal prometheus.Counter
requestErrors prometheus.Counter
requestLatency prometheus.Histogram
// Tracing: With adaptive sampling
tracerProvider *trace.TracerProvider
sampler trace.Sampler
// Load awareness
cpuUsagePercent int32
gcRateHz int32
}
func newOptimizedObservability() *OptimizedObservability {
// Minimal histogram buckets
buckets := prometheus.LinearBuckets(0.001, 0.001, 100)
return &OptimizedObservability{
requestTotal: prometheus.NewCounter(prometheus.CounterOpts{
Name: "http_requests_total",
}),
requestErrors: prometheus.NewCounter(prometheus.CounterOpts{
Name: "http_requests_errors_total",
}),
requestLatency: prometheus.NewHistogram(prometheus.HistogramOpts{
Name: "http_request_duration_seconds",
Buckets: buckets,
}),
tracerProvider: trace.NewTracerProvider(
trace.WithSpanProcessor(
trace.NewBatchSpanProcessor(&noopExporter{}),
),
),
sampler: trace.ParentBased(trace.TraceIDRatioBased(0.01)),
}
}
func (o *OptimizedObservability) recordRequest(ctx context.Context, duration time.Duration, err error) {
// Metrics: Always recorded (cost: ~1μs)
o.requestLatency.Observe(duration.Seconds())
o.requestTotal.Inc()
if err != nil {
o.requestErrors.Inc()
}
// Tracing: Only if sampling enabled and load is low
if atomic.LoadInt32(&o.cpuUsagePercent) < 80 {
ctx, span := o.tracerProvider.Tracer("").Start(ctx, "request")
span.SetAttribute("duration_ms", duration.Milliseconds())
if err != nil {
span.SetAttribute("error", err.Error())
}
span.End()
}
}
func (o *OptimizedObservability) monitorSystemLoad() {
ticker := time.NewTicker(5 * time.Second)
go func() {
for range ticker.C {
var m runtime.MemStats
runtime.ReadMemStats(&m)
// CPU usage estimation
cpuUsage := estimateCPUUsage()
atomic.StoreInt32(&o.cpuUsagePercent, int32(cpuUsage))
// GC rate
gcRate := calculateGCRate()
atomic.StoreInt32(&o.gcRateHz, int32(gcRate))
}
}()
}
// Cost breakdown at 100k RPS:
// Metrics (always): 100k * 1μs = 100ms CPU/sec = 1%
// Tracing (sampled 1%): 1k * 25μs = 25ms CPU/sec = 0.25%
// Total observability overhead: ~1.3% (acceptable)Monitoring Observability Overhead
func benchmarkObservabilityOverhead() {
// Baseline: application without observability
baselineLatency := benchmarkRequest(nil)
// With observability enabled
withObservability := benchmarkRequest(observability)
overhead := (withObservability - baselineLatency) / baselineLatency
fmt.Printf("Observability overhead: %.1f%%\n", overhead*100)
// Target: <5% overhead
// Acceptable: 1-3% overhead
// Warning: >10% overhead indicates over-instrumentation
}
// Typical results:
// - Basic metrics: 1-2% overhead
// - Metrics + 1% tracing: 2-3% overhead
// - Metrics + 10% tracing: 5-8% overhead
// - Full tracing: 20-50% overhead (avoid in production)Observability overhead is real and measurable. Use metrics as your primary observability signal, add strategic sampling-based distributed tracing for debugging, and carefully manage the cardinality of dimensions. With proper configuration, observability overhead can be reduced to 1-3%, providing invaluable insights with minimal performance impact.