Resilient Connection Handling
Master resilience patterns in Go to prevent cascading failures and maximize throughput under adverse conditions
Resilient connection handling is the cornerstone of high-performance distributed systems. When a downstream service degrades, a naive client that continues hammering it with requests destroys both systems. This guide covers proven patterns to protect your Go services: circuit breakers, backpressure, load shedding, bulkheads, and intelligent retries.
Why Resilience is a Performance Concern
Performance optimization often focuses on throughput and latency. But in distributed systems, resilience directly impacts performance. Here's why:
Cascading Failures
When a downstream service becomes unhealthy, clients that keep sending requests waste resources:
- Threads/goroutines pile up waiting for timeouts, consuming memory
- Connection pools exhaust as connections hang, starving other requests
- Upstream services degrade because their clients become slow or unresponsive
- Latency spikes propagate backward through the call chain
A single unhealthy service can collapse an entire cluster.
The Cost of Naive Retry
Retrying without circuit breaking creates a retry storm:
Client 1: GET /api → 500ms timeout → retry → timeout → retry...
Client 2: GET /api → 500ms timeout → retry → timeout → retry...
Client 3: ...In seconds, a degraded service receives 10x normal traffic, accelerating its death.
Performance Under Failure
Without resilience patterns:
- Throughput under failure: 50 req/sec (most blocked)
- Latency under failure: 5-30 seconds (timeout waiting)
- Recovery time: 5+ minutes (when does service realize it's healthy?)
With proper resilience:
- Throughput under failure: 1000 req/sec (shedding load gracefully)
- Latency under failure: under 100ms (fast-fail)
- Recovery time: under 10 seconds (auto-detect health)
Circuit Breaker Pattern
The circuit breaker prevents cascading failures by stopping request flow when a service is unhealthy.
States and Transitions
A circuit breaker has three states:
- Closed: Normal operation, requests pass through
- Open: Service is failing, requests are rejected immediately
- Half-Open: Testing if service recovered, allowing trial requests
Closed --[failure threshold exceeded]--> Open
^ |
| |
+----------[timeout elapsed]-- Half-Open
(allows trial requests)Using sony/gobreaker
The github.com/sony/gobreaker library provides a robust circuit breaker implementation.
package main
import (
"fmt"
"github.com/sony/gobreaker"
"net/http"
"time"
)
func main() {
// Create a circuit breaker with custom settings
cb := gobreaker.NewCircuitBreaker(gobreaker.Settings{
Name: "API",
MaxRequests: 3, // Allow 3 requests in half-open state
Interval: time.Second * 10, // Check health every 10 seconds
Timeout: time.Second * 30, // Open for 30 seconds before half-open
ReadyToTrip: readyToTrip, // Custom trip logic
OnStateChange: onStateChange, // Observe state changes
})
// Use the circuit breaker
result, err := cb.Execute(func() (interface{}, error) {
return callDownstreamAPI()
})
if err != nil {
handleError(err) // Could be circuit.ErrOpenState
}
fmt.Println(result)
}
// Custom logic: trip after 50% failures in last 100 requests
func readyToTrip(counts gobreaker.Counts) bool {
failureRatio := float64(counts.TotalFailures) / float64(counts.Requests)
return counts.Requests >= 100 && failureRatio >= 0.5
}
// Monitor state changes for observability
func onStateChange(name string, from gobreaker.State, to gobreaker.State) {
fmt.Printf("Circuit breaker %s: %s -> %s\n", name, from, to)
}
func callDownstreamAPI() (interface{}, error) {
resp, err := http.Get("http://downstream/api")
if err != nil {
return nil, err
}
defer resp.Body.Close()
if resp.StatusCode >= 500 {
return nil, fmt.Errorf("server error: %d", resp.StatusCode)
}
return resp.Body, nil
}
func handleError(err error) {
if err == gobreaker.ErrOpenState {
fmt.Println("Circuit is open, rejecting request")
} else {
fmt.Println("Request failed:", err)
}
}Circuit Breaker Configuration
Key settings for your use case:
- MaxRequests: How many requests to allow in half-open state (3-10 typical)
- Interval: Reset counter this often (10-30 seconds)
- Timeout: How long to stay open before trying again (10-60 seconds)
- ReadyToTrip: Customize failure detection (error rate, error count, latency)
Tip: Set
Timeoutmuch longer thanInterval. If a service is failing, give it time to recover before testing.
Custom Circuit Breaker Implementation
For educational purposes or special requirements, implement a circuit breaker from scratch:
package resilience
import (
"fmt"
"sync"
"time"
)
type State int
const (
StateClosed State = iota
StateOpen
StateHalfOpen
)
type CircuitBreaker struct {
mu sync.RWMutex
state State
lastFailTime time.Time
failureCount int64
successCount int64
maxFailures int64
resetTimeout time.Duration
lastStateChange time.Time
}
func NewCircuitBreaker(maxFailures int64, timeout time.Duration) *CircuitBreaker {
return &CircuitBreaker{
state: StateClosed,
maxFailures: maxFailures,
resetTimeout: timeout,
}
}
func (cb *CircuitBreaker) Call(fn func() error) error {
cb.mu.Lock()
defer cb.mu.Unlock()
// Check if we should transition to half-open
if cb.state == StateOpen {
if time.Since(cb.lastStateChange) > cb.resetTimeout {
cb.setState(StateHalfOpen)
} else {
return fmt.Errorf("circuit breaker open")
}
}
// Execute the function
err := fn()
if err != nil {
cb.failureCount++
cb.lastFailTime = time.Now()
// Transition to open if thresholds exceeded
if cb.state == StateClosed && cb.failureCount >= cb.maxFailures {
cb.setState(StateOpen)
} else if cb.state == StateHalfOpen {
// One failure while half-open, back to open
cb.setState(StateOpen)
}
return err
}
// Success
cb.successCount++
if cb.state == StateHalfOpen {
// Healthy, close the circuit
cb.setState(StateClosed)
cb.failureCount = 0
cb.successCount = 0
}
return nil
}
func (cb *CircuitBreaker) setState(newState State) {
if cb.state != newState {
fmt.Printf("CB transition: %v -> %v\n", cb.state, newState)
cb.state = newState
cb.lastStateChange = time.Now()
}
}
func (cb *CircuitBreaker) State() State {
cb.mu.RLock()
defer cb.mu.RUnlock()
return cb.state
}Usage:
cb := NewCircuitBreaker(5, 30*time.Second)
err := cb.Call(func() error {
return callAPI()
})Backpressure and Rate Limiting
Backpressure protects downstream services by limiting how much work we send them.
golang.org/x/time/rate Limiter
The standard Go rate limiter uses the token bucket algorithm:
package main
import (
"fmt"
"golang.org/x/time/rate"
"context"
)
func main() {
// Create a limiter: 100 requests per second, burst of 10
limiter := rate.NewLimiter(100, 10)
// Option 1: Blocking (wait until we can take a token)
for i := 0; i < 50; i++ {
if err := limiter.Wait(context.Background()); err != nil {
fmt.Println("Context cancelled:", err)
break
}
go handleRequest(i)
}
// Option 2: Non-blocking (check if token available)
if limiter.Allow() {
fmt.Println("Request allowed")
} else {
fmt.Println("Rate limit exceeded, reject")
}
// Option 3: Reserve tokens for batch operations
reservation := limiter.ReserveN(time.Now(), 5)
if !reservation.OK() {
fmt.Println("Cannot reserve 5 tokens")
return
}
time.Sleep(reservation.Delay()) // Wait if needed
fmt.Println("Processing batch of 5 requests")
}
func handleRequest(id int) {
fmt.Printf("Request %d processed\n", id)
}Custom Backpressure with Semaphore
Control concurrent work with a weighted semaphore:
package main
import (
"context"
"fmt"
"golang.org/x/sync/semaphore"
"sync"
)
func main() {
// Semaphore limits concurrent requests to downstream service
sem := semaphore.NewWeighted(int64(100))
var wg sync.WaitGroup
for i := 0; i < 1000; i++ {
wg.Add(1)
go func(id int) {
defer wg.Done()
// Acquire a permit
if err := sem.Acquire(context.Background(), 1); err != nil {
fmt.Println("Failed to acquire permit")
return
}
defer sem.Release(1)
// Process the request
fmt.Printf("Processing request %d\n", id)
}(i)
}
wg.Wait()
}Tip: Semaphores are ideal for connection pool management. Set the weight equal to your connection pool size.
Load Shedding
Under extreme load, shed requests gracefully instead of queuing them indefinitely.
package main
import (
"fmt"
"net/http"
"sync/atomic"
"time"
)
type LoadShedder struct {
maxConcurrent int64
current int64
rejectedCount int64
}
func NewLoadShedder(maxConcurrent int64) *LoadShedder {
return &LoadShedder{
maxConcurrent: maxConcurrent,
}
}
// Middleware that sheds load
func (ls *LoadShedder) Middleware(next http.Handler) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
current := atomic.AddInt64(&ls.current, 1)
defer atomic.AddInt64(&ls.current, -1)
if current > ls.maxConcurrent {
atomic.AddInt64(&ls.rejectedCount, 1)
w.Header().Set("Retry-After", "5")
http.Error(w, "Service overloaded", http.StatusServiceUnavailable)
return
}
next.ServeHTTP(w, r)
})
}
func (ls *LoadShedder) Stats() (int64, int64) {
return atomic.LoadInt64(&ls.current), atomic.LoadInt64(&ls.rejectedCount)
}
func main() {
shedder := NewLoadShedder(100)
mux := http.NewServeMux()
mux.HandleFunc("/api", func(w http.ResponseWriter, r *http.Request) {
time.Sleep(100 * time.Millisecond)
w.WriteHeader(http.StatusOK)
})
http.ListenAndServe(":8080", shedder.Middleware(mux))
}Key aspects:
- Track concurrent requests atomically
- Return 503 Service Unavailable when limit exceeded
- Include Retry-After header to tell clients when to retry
- Fast rejection (no resource waste) vs slow timeout
Retry with Exponential Backoff and Jitter
Simple retries amplify failure. Exponential backoff with jitter prevents retry storms.
Implementation from Scratch
package main
import (
"fmt"
"math"
"math/rand"
"time"
)
type RetryConfig struct {
MaxAttempts int
InitialWait time.Duration
MaxWait time.Duration
Multiplier float64
}
func DefaultRetryConfig() RetryConfig {
return RetryConfig{
MaxAttempts: 5,
InitialWait: 100 * time.Millisecond,
MaxWait: 10 * time.Second,
Multiplier: 2.0,
}
}
func RetryWithBackoff(cfg RetryConfig, fn func() error) error {
var lastErr error
for attempt := 0; attempt < cfg.MaxAttempts; attempt++ {
err := fn()
if err == nil {
return nil // Success
}
lastErr = err
if attempt == cfg.MaxAttempts-1 {
break // Last attempt
}
// Calculate backoff with jitter
wait := time.Duration(math.Min(
float64(cfg.MaxWait),
float64(cfg.InitialWait)*math.Pow(cfg.Multiplier, float64(attempt)),
))
// Add random jitter: ±25%
jitter := time.Duration(rand.Int63n(int64(wait / 2)))
totalWait := wait + jitter
fmt.Printf("Attempt %d failed, retrying after %v\n", attempt+1, totalWait)
time.Sleep(totalWait)
}
return fmt.Errorf("all %d attempts failed: %w", cfg.MaxAttempts, lastErr)
}
func main() {
cfg := DefaultRetryConfig()
cfg.MaxAttempts = 4
err := RetryWithBackoff(cfg, func() error {
// Simulate API call
return fmt.Errorf("temporary failure")
})
fmt.Println("Final result:", err)
}Using cenkalti/backoff
For production, use a well-tested library:
package main
import (
"fmt"
"github.com/cenkalti/backoff/v4"
"context"
)
func main() {
exp := backoff.NewExponentialBackOff()
exp.InitialInterval = 100 * time.Millisecond
exp.MaxInterval = 10 * time.Second
exp.MaxElapsedTime = 1 * time.Minute
// Retry with timeout
err := backoff.RetryNotify(
func() error {
return callAPI()
},
exp,
func(err error, d time.Duration) {
fmt.Printf("Retry after %v due to: %v\n", d, err)
},
)
if err != nil {
fmt.Println("Failed after retries:", err)
}
}
func callAPI() error {
// Your API call
return nil
}Critical: Always include jitter. Without it, synchronized retries from multiple clients create thundering herds that re-crash the service.
Bulkhead Pattern
Isolate resources per-service to prevent one failing service from starving others.
package main
import (
"fmt"
"golang.org/x/sync/semaphore"
"context"
"sync"
"time"
)
type BulkheadPool struct {
services map[string]*semaphore.Weighted
mu sync.RWMutex
}
func NewBulkheadPool() *BulkheadPool {
return &BulkheadPool{
services: make(map[string]*semaphore.Weighted),
}
}
// Register a service with its resource limit
func (bp *BulkheadPool) Register(serviceName string, maxConcurrent int64) {
bp.mu.Lock()
defer bp.mu.Unlock()
bp.services[serviceName] = semaphore.NewWeighted(maxConcurrent)
}
// Execute a call with bulkhead isolation
func (bp *BulkheadPool) Execute(ctx context.Context, serviceName string, fn func() error) error {
bp.mu.RLock()
sem, ok := bp.services[serviceName]
bp.mu.RUnlock()
if !ok {
return fmt.Errorf("unknown service: %s", serviceName)
}
// Acquire a slot
if err := sem.Acquire(ctx, 1); err != nil {
return fmt.Errorf("bulkhead rejected request: %w", err)
}
defer sem.Release(1)
return fn()
}
func main() {
pool := NewBulkheadPool()
pool.Register("user-service", 50)
pool.Register("order-service", 100)
ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
defer cancel()
// These services won't starve each other
err := pool.Execute(ctx, "user-service", func() error {
fmt.Println("Calling user-service")
return nil
})
fmt.Println("Result:", err)
}Advantages:
- Service A failing doesn't consume all resources
- Service B continues with its allocated pool
- Fair resource distribution across multiple services
Timeout Cascades
Context propagation prevents timeout leaks through your call chain.
Correct Timeout Propagation
package main
import (
"context"
"fmt"
"net/http"
"time"
)
// Handler receives a 5-second timeout
func handleRequest(ctx context.Context, r *http.Request) error {
// Create derived context with shorter timeout for each downstream call
// Reserve time for work at this level
deadline, ok := ctx.Deadline()
if !ok {
return fmt.Errorf("no deadline")
}
remaining := time.Until(deadline)
fmt.Printf("Remaining time: %v\n", remaining)
// Allocate time budget: 1s for this level, rest for downstream
thisLevelBudget := 1 * time.Second
downstreamBudget := remaining - thisLevelBudget
if downstreamBudget <= 0 {
return fmt.Errorf("insufficient time remaining")
}
// Create context for downstream call with proper deadline
downstreamCtx, cancel := context.WithTimeout(ctx, downstreamBudget)
defer cancel()
return callDownstream(downstreamCtx)
}
func callDownstream(ctx context.Context) error {
// This context has proper timeout inherited
select {
case <-time.After(100 * time.Millisecond):
fmt.Println("Downstream completed")
return nil
case <-ctx.Done():
return ctx.Err()
}
}
func main() {
// Entry point: 5-second timeout
ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
defer cancel()
err := handleRequest(ctx, nil)
fmt.Println("Error:", err)
}Common Mistakes
// WRONG: Each timeout starts fresh (total could be 5 + 5 + 5 = 15s)
ctx1, _ := context.WithTimeout(context.Background(), 5*time.Second)
callA(ctx1)
ctx2, _ := context.WithTimeout(context.Background(), 5*time.Second)
callB(ctx2)
// CORRECT: Derived contexts share the parent deadline
ctx, _ := context.WithTimeout(context.Background(), 5*time.Second)
ctxA, _ := context.WithTimeout(ctx, 2*time.Second) // Can't exceed 5s
callA(ctxA)
ctxB, _ := context.WithTimeout(ctx, 2*time.Second) // Still bounded by 5s
callB(ctxB)Health Checks and Readiness Probes
Graceful degradation requires knowing when to shed load.
package main
import (
"fmt"
"net/http"
"sync"
"sync/atomic"
"time"
)
type HealthChecker struct {
mu sync.RWMutex
healthy bool
lastCheckTime time.Time
failureCount int64
consecutiveFail int64
checkInterval time.Duration
}
func NewHealthChecker(interval time.Duration) *HealthChecker {
hc := &HealthChecker{
healthy: true,
checkInterval: interval,
}
go hc.periodicCheck()
return hc
}
func (hc *HealthChecker) periodicCheck() {
ticker := time.NewTicker(hc.checkInterval)
defer ticker.Stop()
for range ticker.C {
if hc.checkDownstream() {
atomic.StoreInt64(&hc.consecutiveFail, 0)
hc.setHealthy(true)
} else {
count := atomic.AddInt64(&hc.consecutiveFail, 1)
atomic.AddInt64(&hc.failureCount, 1)
// Mark unhealthy after 3 consecutive failures
if count >= 3 {
hc.setHealthy(false)
}
}
}
}
func (hc *HealthChecker) checkDownstream() bool {
ctx, cancel := context.WithTimeout(context.Background(), 2*time.Second)
defer cancel()
req, _ := http.NewRequestWithContext(ctx, "GET", "http://downstream/health", nil)
resp, err := http.DefaultClient.Do(req)
if err != nil {
return false
}
defer resp.Body.Close()
return resp.StatusCode == http.StatusOK
}
func (hc *HealthChecker) setHealthy(healthy bool) {
hc.mu.Lock()
defer hc.mu.Unlock()
hc.healthy = healthy
}
func (hc *HealthChecker) IsHealthy() bool {
hc.mu.RLock()
defer hc.mu.RUnlock()
return hc.healthy
}
// Middleware to use health status
func (hc *HealthChecker) ReadinessMiddleware(next http.Handler) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
if !hc.IsHealthy() {
http.Error(w, "Service unhealthy", http.StatusServiceUnavailable)
return
}
next.ServeHTTP(w, r)
})
}Combining Patterns: The Resilience Stack
Use multiple patterns together for maximum robustness:
package main
import (
"context"
"fmt"
"github.com/sony/gobreaker"
"golang.org/x/time/rate"
"net/http"
"time"
)
type ResilientClient struct {
breaker *gobreaker.CircuitBreaker
limiter *rate.Limiter
timeout time.Duration
retries int
}
func NewResilientClient() *ResilientClient {
cb := gobreaker.NewCircuitBreaker(gobreaker.Settings{
Name: "HTTP",
MaxRequests: 3,
Interval: 10 * time.Second,
Timeout: 30 * time.Second,
ReadyToTrip: func(counts gobreaker.Counts) bool {
ratio := float64(counts.TotalFailures) / float64(counts.Requests)
return counts.Requests >= 10 && ratio >= 0.5
},
})
limiter := rate.NewLimiter(100, 10) // 100 req/s, burst 10
return &ResilientClient{
breaker: cb,
limiter: limiter,
timeout: 5 * time.Second,
retries: 3,
}
}
func (rc *ResilientClient) Do(ctx context.Context, req *http.Request) (*http.Response, error) {
// 1. Rate limiting (backpressure)
if err := rc.limiter.Wait(ctx); err != nil {
return nil, fmt.Errorf("rate limited: %w", err)
}
// 2. Circuit breaker + retry
var lastErr error
for attempt := 0; attempt < rc.retries; attempt++ {
result, err := rc.breaker.Execute(func() (interface{}, error) {
// 3. Timeout (cascading via context)
callCtx, cancel := context.WithTimeout(ctx, rc.timeout)
defer cancel()
return http.DefaultClient.Do(req.WithContext(callCtx))
})
if err == nil {
return result.(*http.Response), nil
}
lastErr = err
// Don't retry if circuit is open
if err == gobreaker.ErrOpenState {
break
}
// Exponential backoff
backoff := time.Duration(1<<uint(attempt)) * 100 * time.Millisecond
select {
case <-time.After(backoff):
case <-ctx.Done():
return nil, ctx.Err()
}
}
return nil, fmt.Errorf("all retries exhausted: %w", lastErr)
}
func main() {
client := NewResilientClient()
req, _ := http.NewRequest("GET", "http://api.example.com/data", nil)
ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
defer cancel()
resp, err := client.Do(ctx, req)
if err != nil {
fmt.Println("Request failed:", err)
} else {
fmt.Println("Success:", resp.StatusCode)
resp.Body.Close()
}
}This stack:
- Rate limiter protects downstream (backpressure)
- Circuit breaker fails fast when service is down
- Retries with backoff handle transient failures
- Timeout prevents hanging connections
- Context propagation ensures timeouts cascade properly
Benchmarks: Circuit Breaker Impact
Here's a realistic benchmark showing throughput under failure:
package main
import (
"fmt"
"github.com/sony/gobreaker"
"sync"
"sync/atomic"
"testing"
"time"
)
// Simulated failure: service returns error for 10 seconds
func failingService() error {
return fmt.Errorf("service error")
}
// With circuit breaker
func benchmarkWithCircuitBreaker(b *testing.B) {
cb := gobreaker.NewCircuitBreaker(gobreaker.Settings{
Name: "test",
MaxRequests: 1,
Interval: time.Second,
Timeout: 5 * time.Second,
ReadyToTrip: func(counts gobreaker.Counts) bool {
return counts.TotalFailures >= 3
},
})
successCount := int64(0)
failureCount := int64(0)
b.ReportAllocs()
b.ResetTimer()
var wg sync.WaitGroup
for i := 0; i < 100; i++ {
wg.Add(1)
go func() {
defer wg.Done()
for j := 0; j < b.N/100; j++ {
_, err := cb.Execute(func() (interface{}, error) {
return nil, failingService()
})
if err == gobreaker.ErrOpenState {
atomic.AddInt64(&failureCount, 1)
} else if err == nil {
atomic.AddInt64(&successCount, 1)
} else {
atomic.AddInt64(&failureCount, 1)
}
}
}()
}
wg.Wait()
b.StopTimer()
fmt.Printf("CB: Success=%d, FastFail=%d\n",
atomic.LoadInt64(&successCount),
atomic.LoadInt64(&failureCount))
}
// Without circuit breaker (naive retry)
func benchmarkWithoutCircuitBreaker(b *testing.B) {
successCount := int64(0)
failureCount := int64(0)
b.ReportAllocs()
b.ResetTimer()
var wg sync.WaitGroup
for i := 0; i < 100; i++ {
wg.Add(1)
go func() {
defer wg.Done()
for j := 0; j < b.N/100; j++ {
if err := failingService(); err != nil {
atomic.AddInt64(&failureCount, 1)
} else {
atomic.AddInt64(&successCount, 1)
}
}
}()
}
wg.Wait()
b.StopTimer()
fmt.Printf("No CB: Success=%d, Failures=%d\n",
atomic.LoadInt64(&successCount),
atomic.LoadInt64(&failureCount))
}
func main() {
// Simulate multiple concurrent clients
fmt.Println("=== With Circuit Breaker ===")
fmt.Println("All requests fail fast after threshold (rejected by CB)")
fmt.Println("\n=== Without Circuit Breaker ===")
fmt.Println("All requests attempt the call, wasting resources")
}Expected Results
With Circuit Breaker (under 5-second service outage):
- Requests during closed: ~1000 attempts (fail fast)
- Requests during open: ~900 rejected (0ms latency)
- Total throughput: ~1900 requests/sec
Without Circuit Breaker (naive approach):
- All ~1900 requests attempt call
- Each waits for timeout (~100ms default)
- Effective throughput: ~10 requests/sec
- 19x throughput improvement with circuit breaker
Best Practices Summary
- Always use circuit breakers for downstream service calls
- Combine with retries: CB prevents cascading, retries handle transient failures
- Add rate limiting as backpressure on yourself
- Shed load (return 503) under extreme load, don't queue indefinitely
- Use bulkheads for isolation when calling multiple services
- Propagate contexts with proper timeout budgets
- Monitor health probes and gracefully degrade when dependencies fail
- Add jitter to all retry intervals
- Set reasonable timeouts (~5-30 seconds) per service SLA
- Test failure scenarios with chaos engineering
The combination of these patterns transforms your Go services from fragile to resilient, maintaining performance even when dependencies fail.