Real-World Performance Case Studies

Practical performance optimization walkthroughs — from profiling to production, covering API gateways, data pipelines, CLI tools, and high-throughput services.

The Optimization Workflow

Performance optimization follows a scientific method: measure, hypothesize, experiment, verify. Skipping measurement is the most common mistake, leading to wasted effort on irrelevant optimizations.

The Process

Measure First
- Establish baseline metrics: throughput, latency, memory usage, CPU
- Use benchmarks, production profiling, and load tests
- Identify where time/resources are actually spent
Identify the Bottleneck
- CPU-bound: inlining failures, excessive allocations, hot loops
- Memory-bound: GC pressure, large allocations, cache misses
- I/O-bound: system calls, network roundtrips, disk seeks
- Contention: lock contention, channel blocking, goroutine starvation
Form a Hypothesis
- "JSON serialization takes 40% of request time"
- "Memory allocations trigger GC every 5ms"
- "Database roundtrips are the bottleneck"
Apply Targeted Optimization
- Change only one thing at a time
- Keep hypothesis focused (easy to debug if wrong)
Measure Again
- Verify hypothesis was correct
- Quantify improvement
- Check for side effects (latency increase with higher throughput)
Repeat
- Focus on next-biggest bottleneck
- Diminishing returns: expect 5-10% improvement per iteration

Amdahl's Law

If a bottleneck consumes 40% of execution time, the best possible speedup is 2.5x (100% / (60% + 40%/∞)). Optimize the biggest bottleneck first.

If you make operation A 10x faster:
- A takes 40% of time: overall speedup = 1.4x
- A takes 10% of time: overall speedup = 1.01x

Lesson: Find the biggest bottleneck. Small optimizations on cold paths waste effort.

Case Study 1: REST API Gateway (50k req/s, p99 Latency Spikes)

Scenario

A microservices API gateway handling 50k requests per second shows p99 latency spikes from 8ms to 45ms. The service uses standard Go HTTP with encoding/json for request/response serialization.

Diagnosis: Profiling

# Capture CPU profile
curl http://localhost:6060/debug/pprof/profile?seconds=30 > cpu.prof
go tool pprof cpu.prof

# Interactive analysis
(pprof) top
Showing nodes accounting for 4200ms, 87.5% of 4800ms total
      flat  flat%   sum%        cum   cum%
    1200ms 25.0% 25.0%      2100ms 43.8%  encoding/json.(*encodeState).string
     850ms 17.7% 42.7%      2400ms 50.0%  encoding/json.(*encodeState).value
     620ms 12.9% 55.6%       620ms 12.9%  runtime.mallocgc
     ...

Profile reveals:

43.8% CPU in JSON serialization
12.9% in allocations (GC pressure)
15% in HTTP handler chain

Optimizations Applied

1. Switch to sonic (Fast JSON Serializer)

sonic uses code generation and SIMD for 3-5x JSON throughput:

// Before: encoding/json
import "encoding/json"

type Response struct {
    ID    int    `json:"id"`
    Name  string `json:"name"`
    Score float64 `json:"score"`
}

func handleRequest(w http.ResponseWriter, r *http.Request) {
    resp := Response{ID: 1, Name: "Alice", Score: 98.5}
    w.Header().Set("Content-Type", "application/json")
    json.NewEncoder(w).Encode(resp)  // ~2.5µs per response
}

// After: sonic (or easyjson)
import "github.com/bytedance/sonic"

func handleRequest(w http.ResponseWriter, r *http.Request) {
    resp := Response{ID: 1, Name: "Alice", Score: 98.5}
    w.Header().Set("Content-Type", "application/json")
    data, _ := sonic.Marshal(resp)    // ~0.6µs per response
    w.Write(data)
}

Alternatively, use easyjson code generation:

go install github.com/mailru/easyjson/cmd/easyjson@latest
easyjson -all response.go  # Generates Response.MarshalJSON

2. Request/Response Object Pooling

Reduce allocations with sync.Pool:

var requestBufferPool = sync.Pool{
    New: func() interface{} {
        return &RequestBuffer{
            buf: make([]byte, 0, 8192),
            headers: make(map[string]string),
        }
    },
}

type RequestBuffer struct {
    buf     []byte
    headers map[string]string
}

func handleRequest(w http.ResponseWriter, r *http.Request) {
    rb := requestBufferPool.Get().(*RequestBuffer)
    defer requestBufferPool.Put(rb)

    // Reset for reuse
    rb.buf = rb.buf[:0]
    for k := range rb.headers {
        delete(rb.headers, k)
    }

    // Process request using pooled buffer
    data, _ := sonic.Marshal(response)
    rb.buf = append(rb.buf, data...)
    w.Write(rb.buf)
}

Impact: 60% fewer allocations per request.

3. Connection Pooling for Downstream Services

// Before: New connection per request
func callDownstream(ctx context.Context) (*Service, error) {
    client := &http.Client{Timeout: 2 * time.Second}
    resp, err := client.Get("http://internal-service:8080/data")
    // Creates new connection, DNS lookup, TLS handshake
    ...
}

// After: Reused connection pool
var downstreamClient = &http.Client{
    Timeout: 2 * time.Second,
    Transport: &http.Transport{
        MaxIdleConns:        100,
        MaxIdleConnsPerHost: 10,
        IdleConnTimeout:     90 * time.Second,
        DisableKeepAlives:   false,
    },
}

func callDownstream(ctx context.Context) (*Service, error) {
    resp, err := downstreamClient.Get("http://internal-service:8080/data")
    // Reuses TCP connection, no handshakes
    ...
}

4. HTTP Handler Middleware Optimization

// Before: Middleware chain with allocations
func chainMiddleware(h http.Handler, mw ...func(http.Handler) http.Handler) http.Handler {
    for i := len(mw) - 1; i >= 0; i-- {
        h = mw[i](h)  // Wraps handler, adds allocation
    }
    return h
}

// After: Minimal allocation handler chain
type Handler struct {
    fn   func(http.ResponseWriter, *http.Request)
    next *Handler
}

func (h *Handler) ServeHTTP(w http.ResponseWriter, r *http.Request) {
    h.fn(w, r)
    if h.next != nil {
        h.next.ServeHTTP(w, r)
    }
}

5. Pre-Computed Responses for Common Queries

// Cache frequently requested data
var (
    staticResponseCache = sync.Map{}
    cacheTTL = 1 * time.Second
    cacheUpdateTicker = time.NewTicker(cacheTTL)
)

func handleUsersList(w http.ResponseWriter, r *http.Request) {
    if cached, ok := staticResponseCache.Load("users_list"); ok {
        w.Header().Set("Content-Type", "application/json")
        w.Write(cached.([]byte))
        return
    }

    // Compute only on cache miss
    users := getUsers()
    data, _ := sonic.Marshal(users)
    staticResponseCache.Store("users_list", data)
    w.Write(data)
}

Results

Metric                  Before      After       Improvement
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Throughput              50k req/s   150k req/s  3x
P50 Latency             2ms         1.5ms       25% better
P99 Latency             45ms        8ms         82% better (MAJOR)
Allocations/req         12          3           75% fewer
GC Pause                2.1ms       0.4ms       81% reduction
Memory Usage            850MB       420MB       50% reduction
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

The p99 latency spike was caused by GC pauses triggered by excessive allocations. Fixing serialization and pooling eliminated the spike entirely.

Case Study 2: Data Pipeline / ETL (10M Records CSV → Postgres)

Scenario

A data pipeline processes 10M CSV records (2GB file), transforms each row, and loads into PostgreSQL. Initial implementation takes 45 minutes.

Diagnosis

Profile with pprof and observe:

Time breakdown (45 minutes total):
- Reading CSV: 15 minutes (33%)
- Per-row allocation: 12 minutes (27%)
- INSERT statements: 18 minutes (40%)

Bottleneck: database roundtrips. Each INSERT statement requires:

Query parsing
Plan preparation
Execution
Network roundtrip

Optimizations Applied

1. Buffered Reader with Larger Buffer

// Before: Default buffered reader
import "encoding/csv"

file, _ := os.Open("data.csv")
reader := csv.NewReader(file)  // Default 4KB buffer
for {
    record, _ := reader.Read()
    processRow(record)
}

// After: 64KB buffer + larger buffer pool
reader := csv.NewReader(file)
reader.Buffer = make([]byte, 64*1024)  // 64KB vs 4KB
for {
    record, _ := reader.Read()
    processRow(record)
}

Impact: ~10% faster file reading (fewer syscalls).

2. Pre-Allocated Slices for Batch Processing

// Before: Append to slice per row
var rows []Row
for {
    record, _ := reader.Read()
    rows = append(rows, parseRow(record))  // Reallocation every ~64 rows
}

// After: Pre-allocate with capacity
rows := make([]Row, 0, 100000)  // Batch of 100k
for i := 0; i < 10000000; i++ {
    record, _ := reader.Read()
    rows = append(rows, parseRow(record))

    if len(rows) == cap(rows) {
        loadBatch(rows)
        rows = rows[:0]  // Reuse slice
    }
}

3. COPY Protocol Instead of INSERT

The biggest optimization: use PostgreSQL's COPY protocol instead of individual INSERTs.

// Before: Individual INSERT statements
import "database/sql"

db, _ := sql.Open("postgres", connStr)
for _, row := range rows {
    db.Exec(
        "INSERT INTO users (id, name, email) VALUES ($1, $2, $3)",
        row.ID, row.Name, row.Email,
    )  // Each: parse + plan + execute + network roundtrip = ~10ms
}

// After: COPY bulk insert
import "github.com/jackc/pgx/v5"

conn, _ := pgx.Connect(context.Background(), connStr)

rows := make([][]interface{}, 0, 100000)
for record := range csvRecords {
    rows = append(rows, []interface{}{record.ID, record.Name, record.Email})

    if len(rows) == 100000 {
        // Bulk insert via COPY (single roundtrip for 100k rows)
        conn.CopyFrom(
            context.Background(),
            []string{"id", "name", "email"},
            pgx.CopyFromRows(rows),
        )
        rows = rows[:0]
    }
}

Performance comparison:

INSERT: 1 statement = 1 roundtrip = ~10ms. For 10M rows = 100,000 seconds
COPY: 100,000 rows = 1 roundtrip = ~50ms. For 10M rows = 5 seconds

4. Worker Pool for Parallel Chunk Processing

const numWorkers = 4

type ChunkJob struct {
    records [][]string
}

jobChan := make(chan ChunkJob, numWorkers)

// Start worker pool
for i := 0; i < numWorkers; i++ {
    go func() {
        conn, _ := pgx.Connect(context.Background(), connStr)
        defer conn.Close(context.Background())

        for job := range jobChan {
            rows := make([][]interface{}, len(job.records))
            for i, record := range job.records {
                rows[i] = []interface{}{record[0], record[1], record[2]}
            }
            conn.CopyFrom(context.Background(), []string{"id", "name", "email"}, pgx.CopyFromRows(rows))
        }
    }()
}

// Producer: read CSV, chunk, send to workers
var chunk [][]string
for {
    record, _ := reader.Read()
    chunk = append(chunk, record)

    if len(chunk) == 100000 {
        jobChan <- ChunkJob{records: chunk}
        chunk = make([][]string, 0, 100000)
    }
}
close(jobChan)

Impact: 4 parallel database connections = 4x faster loading.

5. Pipeline Pattern: Read → Transform → Load Stages

// Stage 1: CSV Reader
type CSVRecord []string

csvChan := make(chan CSVRecord, 10000)
go func() {
    reader := csv.NewReader(file)
    for {
        record, _ := reader.Read()
        csvChan <- record
    }
    close(csvChan)
}()

// Stage 2: Transform
type UserRow struct {
    ID    int
    Name  string
    Email string
}

transformChan := make(chan UserRow, 10000)
go func() {
    for record := range csvChan {
        transformChan <- UserRow{
            ID:    parseID(record[0]),
            Name:  record[1],
            Email: record[2],
        }
    }
    close(transformChan)
}()

// Stage 3: Batch & Load
go func() {
    var batch []UserRow
    for row := range transformChan {
        batch = append(batch, row)
        if len(batch) == 100000 {
            loadBatchVIACOPY(batch)
            batch = batch[:0]
        }
    }
}()

Results

Metric                  Before      After       Improvement
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Total Time              45 minutes  3 minutes   15x faster
CSV Read Time           15 min      13 min      13% (buffering helped)
Database Load Time      18 min      2.5 min     7x (COPY + parallelism)
Per-Record Alloc        0.6µs       0.04µs      15x fewer
Memory Peak             2.5GB       420MB       84% reduction
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

The critical insight: don't optimize I/O in isolation. The real bottleneck was architecture (INSERT vs COPY protocol).

Case Study 3: CLI File Search Tool (100k Files, Slow Startup)

Scenario

A "grep-like" file search tool scans 100k files looking for pattern matches. Startup takes 200ms, actual search takes 150ms. Users expect under 100ms for snappy CLI feel.

Diagnosis

$ time ./search "pattern" /large/directory

# Breakdown with detailed timing:
real    0m0.200s  (startup)
user    0m0.048s  (search)
sys     0m0.032s  (file I/O)

Problems identified:

Startup overhead: Regexp compiled on every run (30ms)
File traversal: Using filepath.Walk calls os.Stat per file (50ms lost)
Single-threaded: No parallelism (file traversal is I/O-bound)

Optimizations Applied

1. Compile Regexp Once (Global Variable)

// Before: Compile per invocation
func main() {
    pattern := flag.String("pattern", "", "search pattern")
    flag.Parse()

    re, _ := regexp.Compile(*pattern)  // 10-30ms per startup

    search(re)
}

// After: Compile at init or parse arguments once
var compiledRegex *regexp.Regexp

func init() {
    // Or compile in main() once and reuse
}

func main() {
    pattern := flag.String("pattern", "", "search pattern")
    flag.Parse()

    var err error
    compiledRegex, err = regexp.Compile(*pattern)  // Compile ONCE
    if err != nil {
        log.Fatal(err)
    }

    search(compiledRegex)
}

Impact: -30ms startup.

2. WalkDir Instead of Walk (No Stat Per File)

// Before: filepath.Walk calls os.Stat per entry
filepath.Walk(dir, func(path string, info os.FileInfo, err error) error {
    if !info.IsDir() {
        searchFile(path)
    }
    return nil
})
// Each entry: lstat syscall = 0.5ms × 100k files = 50 seconds wasted

// After: filepath.WalkDir includes FileInfo without extra syscalls
filepath.WalkDir(dir, func(path string, d fs.DirEntry, err error) error {
    if !d.IsDir() {
        searchFile(path)
    }
    return nil
})
// No extra lstat: stat is already in DirEntry = 0 extra syscalls

Impact: -50ms (huge for file traversal).

3. Parallel Directory Traversal

// Before: Sequential scan
func searchDir(path string) {
    filepath.WalkDir(path, func(path string, d fs.DirEntry, err error) error {
        if !d.IsDir() {
            searchFile(path)
        }
        return nil
    })
}

// After: Goroutine pool for directory recursion
const maxConcurrency = 8
sem := make(chan struct{}, maxConcurrency)

func searchDirParallel(path string, wg *sync.WaitGroup) {
    defer wg.Done()

    entries, _ := os.ReadDir(path)

    for _, e := range entries {
        if e.IsDir() {
            wg.Add(1)
            sem <- struct{}{}  // Acquire semaphore

            go func(subdir string) {
                defer func() { <-sem }()
                searchDirParallel(subdir, wg)
            }(filepath.Join(path, e.Name()))
        } else {
            searchFile(filepath.Join(path, e.Name()))
        }
    }
}

Impact: 4x faster traversal (4 goroutines × files spread across SSDs/HDD).

4. Binary Size Optimization

# Default binary: 15MB
go build -o search

# Production binary: 4.2MB
go build -ldflags="-s -w" -trimpath CGO_ENABLED=0 -o search

# With UPX: 1.8MB (startup cost: 30ms decompression)
upx --best search

For CLI tools, 30ms UPX decompression is often acceptable if saving bandwidth matters.

5. Pre-Compiled Search Patterns

If patterns are known ahead of time:

# Generate code for patterns at build time
go generate ./...

// pattern_codegen.go
//go:generate patterns_generator

// Generated file: patterns_generated.go
var patterns = []*regexp.Regexp{
    regexp.MustCompile("error"),
    regexp.MustCompile("warning"),
    regexp.MustCompile("fatal"),
}
// Compiled at init(), no runtime compilation

Results

Metric                  Before      After       Improvement
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Startup Time            200ms       30ms        6.6x
Traversal Time          150ms       40ms        3.7x
Total (Startup+Search)  200ms       70ms        2.8x
Binary Size             15MB        4.2MB       72% smaller
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Startup is now snappy (under 100ms). The remaining 40ms is core file traversal (hard to optimize further without architecture change like indexing).

Case Study 4: High-Throughput Kafka Message Processor (500k msg/s, Falling Behind)

Scenario

A service consuming from Kafka at 500k messages/second, performing light transformation, and writing to ClickHouse. Consumer is falling behind (lag increasing 10k messages/minute), consuming 80% CPU.

Diagnosis

GC profile shows 50% CPU in garbage collection. CPU profile reveals:

CPU Profile:
- JSON unmarshaling: 25% CPU
- Per-message allocations: 20% CPU
- GC: 50% CPU (mark/sweep)
- ClickHouse INSERTS: 5% CPU

Memory Profile:
- 500k msgs/sec × 2KB per message = 1GB/sec allocation rate
- GC target: 2GB heap
- GC triggered every 2 seconds (pause: 200ms each)

Problem: Excessive allocations trigger frequent GC pauses, causing lag.

Optimizations Applied

1. Object Pooling for Message Structs

// Message structure
type KafkaMessage struct {
    ID        string                 `json:"id"`
    Timestamp int64                  `json:"timestamp"`
    Data      map[string]interface{} `json:"data"`
    Metadata  map[string]string      `json:"metadata"`
}

// Before: Allocate new message per record
func processMessage(data []byte) {
    var msg KafkaMessage
    json.Unmarshal(data, &msg)
    handleMessage(&msg)
    // msg goes out of scope → GC candidate
}

// After: Pool reusable message structs
var msgPool = sync.Pool{
    New: func() interface{} {
        return &KafkaMessage{
            Data:     make(map[string]interface{}),
            Metadata: make(map[string]string),
        }
    },
}

func processMessage(data []byte) {
    msg := msgPool.Get().(*KafkaMessage)
    defer msgPool.Put(msg)

    // Reset maps
    for k := range msg.Data {
        delete(msg.Data, k)
    }
    for k := range msg.Metadata {
        delete(msg.Metadata, k)
    }

    json.Unmarshal(data, msg)
    handleMessage(msg)
    // msg returned to pool, reused next iteration
}

Impact: ~70% fewer allocations.

2. Batch Processing

// Before: Process message immediately
for msg := range kafkaConsumer {
    processMessage(msg)
    writeToClickHouse(transformed)  // 500k inserts/sec
}

// After: Batch 1000 messages, process together
const batchSize = 1000
batch := make([]*TransformedMessage, 0, batchSize)

for msg := range kafkaConsumer {
    transformed := processMessage(msg)
    batch = append(batch, transformed)

    if len(batch) == batchSize {
        writeClickHouseBatch(batch)
        batch = batch[:0]
    }
}

Impact: 1000x fewer database roundtrips (500k inserts → 500 batch inserts).

3. Switch to Protobuf with vtprotobuf

// Before: JSON unmarshaling (slower, larger messages)
import "encoding/json"

type Message struct {
    ID   string `json:"id"`
    Type int    `json:"type"`
    Data string `json:"data"`
}

// After: Protobuf (compact, fast unmarshaling)
import "google.golang.org/protobuf/proto"

// message.proto
syntax = "proto3";
message Message {
    string id = 1;
    int32 type = 2;
    string data = 3;
}

// With vtprotobuf:
// go install github.com/planetscale/vtprotobuf/cmd/protoc-gen-go-vtproto@latest
// Generate with: protoc --go_out=. --go-vtproto_out=. message.proto

Protobuf benefits:

~40% smaller message size
~3x faster unmarshaling
~5x less allocations during unmarshaling

4. GOMEMLIMIT and GOGC Tuning

// Set memory limit for GC
import "runtime/debug"

func init() {
    // Go 1.19+: Hard limit on heap
    debug.SetMemoryLimit(5 * 1024 * 1024 * 1024)  // 5GB max

    // GOGC: Control GC frequency (default 100 = GC when heap doubles)
    // Lower GOGC = more frequent GC, less pausing
    // Higher GOGC = less frequent GC, longer pauses
}

// In deployment:
// GOGC=50 (GC when heap grows 50%) - more frequent but shorter pauses
// GOMEMLIMIT=5GiB go run main.go

5. Partitioned Channels to Reduce Contention

// Before: Single channel bottleneck
msgChan := make(chan *Message, 1000)

for i := 0; i < 4; i++ {
    go func() {
        for msg := range msgChan {
            process(msg)
        }
    }()
}

// After: Partition by message ID hash
const numPartitions = 16
channels := make([]chan *Message, numPartitions)
for i := 0; i < numPartitions; i++ {
    channels[i] = make(chan *Message, 1000)
}

// Send messages to partitions
go func() {
    for msg := range kafkaConsumer {
        partition := hashID(msg.ID) % numPartitions
        channels[partition] <- msg
    }
}()

// Process each partition independently
for i := 0; i < numPartitions; i++ {
    for j := 0; j < 4; j++ {
        go func(ch chan *Message) {
            for msg := range ch {
                process(msg)
            }
        }(channels[i])
    }
}

Impact: Reduces channel lock contention from single serialization point.

Results

Metric                      Before          After           Improvement
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Throughput                  500k msg/s      1.2M msg/s      2.4x
GC Pause Time               200ms           8ms             96% reduction
GC CPU Time                 50%             5%              90% reduction
Allocations per Message     8               1               87.5% fewer
Kafka Consumer Lag          +10k/min        -50k/min        Catching up
P99 Latency                 150ms           12ms            92% reduction
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

The service is now sustaining 1.2M msg/s with minimal GC impact. Lag decreases instead of increasing.

Case Study 5: Memory-Constrained Service (256MB Limit, Frequent OOMs)

Scenario

A microservice in Kubernetes with 256MB memory limit crashes with OOMKilled events 3-5 times per day. The service runs fine for 6-12 hours, then crashes suddenly.

Diagnosis

Memory profile shows:

Alloc = 240MB Sys = 380MB NumGC = 12345
HeapAlloc = 240MB HeapSys = 380MB HeapIdle = 60MB
HeapInuse = 320MB HeapReleased = 0MB HeapObjects = 1200000

Problem: Heap never shrinks. Fragmented, no memory returned to OS.

Root causes:

Unbounded cache: LRU cache grows to 200MB, never evicts
Large temporary allocations: Request processing allocates 5MB buffers
Memory fragmentation: Go can't return freed memory to OS

Optimizations Applied

1. LRU Cache with Size Limit

// Before: Unbounded cache
type UnboundedCache struct {
    data map[string]interface{}
    mu   sync.RWMutex
}

func (c *UnboundedCache) Get(key string) interface{} {
    c.mu.RLock()
    defer c.mu.RUnlock()
    return c.data[key]
}

func (c *UnboundedCache) Set(key string, value interface{}) {
    c.mu.Lock()
    defer c.mu.Unlock()
    c.data[key] = value  // Unbounded growth
}

// After: LRU with maximum size
type BoundedLRUCache struct {
    data map[string]interface{}
    lru  *lru.Cache  // github.com/hashicorp/golang-lru
    mu   sync.RWMutex
    maxSize int
}

func (c *BoundedLRUCache) Set(key string, value interface{}) {
    c.mu.Lock()
    defer c.mu.Unlock()

    c.lru.Add(key, value)  // Evicts oldest if at capacity
    if c.lru.Len() > c.maxSize {
        c.lru.RemoveOldest()
    }
}

// Initialize with 10MB max size
cache, _ := lru.New(100000)  // 10000 items × 1KB avg = 10MB

2. Streaming Processing Instead of Loading Full Dataset

// Before: Load entire file into memory
func processFile(filename string) error {
    data, _ := ioutil.ReadFile(filename)  // 50MB file loaded at once

    var records []Record
    json.Unmarshal(data, &records)

    for _, record := range records {
        process(record)
    }
    return nil
}

// After: Stream processing
func processFile(filename string) error {
    file, _ := os.Open(filename)
    defer file.Close()

    decoder := json.NewDecoder(file)
    decoder.UseNumber()  // Don't parse numbers

    for {
        var record Record
        if err := decoder.Decode(&record); err == io.EOF {
            break
        }
        process(record)
        // Memory: constant ~10KB, not 50MB
    }
    return nil
}

3. GOMEMLIMIT and GOGC Tuning

// Force aggressive GC and memory limits
import "runtime/debug"

func init() {
    // Hard memory limit: 200MB (leave 56MB for OS/buffers)
    debug.SetMemoryLimit(200 * 1024 * 1024)

    // GOGC=30: Trigger GC frequently (less memory held, more CPU)
    // Tradeoff: 2% CPU increase, zero OOMs
}

Kubernetes deployment:

apiVersion: v1
kind: Pod
metadata:
  name: memory-constrained-service
spec:
  containers:
  - name: app
    image: myapp:latest
    env:
    - name: GOMEMLIMIT
      value: "200MiB"
    - name: GOGC
      value: "30"
    resources:
      limits:
        memory: "256Mi"
        cpu: "500m"

4. Escape Analysis Fixes: Return Values to Stack

// Before: Allocates on heap
func newConfig() *Config {
    return &Config{
        Name: "default",
        Timeout: 30 * time.Second,
    }
}

// After: Return by value (stack allocated if inlined)
func newConfig() Config {
    return Config{
        Name: "default",
        Timeout: 30 * time.Second,
    }
}

// Caller
cfg := newConfig()  // Allocated on stack, zero heap pressure

5. Arena Allocation for Request-Scoped Data

// Go 1.20+: Arena for fast allocation/deallocation
import "arena"

func handleRequest(w http.ResponseWriter, r *http.Request) {
    // Allocate all request data in one arena
    a := arena.NewArena()
    defer a.Free()

    // Allocations using arena.New() within this scope
    requestData := arena.New[RequestData](a)
    requestData.Parse(r.Body)

    response := processRequest(requestData, a)
    w.Write(response)

    // All allocations freed when arena is freed
}

Impact: Request-scoped allocations freed together, preventing fragmentation.

Results

Metric                      Before          After           Improvement
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Heap Usage (Idle)           240MB           120MB           50% reduction
Peak Heap Usage             250MB           180MB           28% reduction
OOM Events per Week         10-15           0               100% elimination
Cache Size Limit            Unbounded       50MB            Capped
GC Pause Time               50ms            15ms            70% reduction
CPU Usage (from GC)         12%             8%              33% reduction
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Service now runs indefinitely within 256MB limit. No more OOMKilled events.

Optimization Checklist

Low-Hanging Fruit (Always Check First)

These optimizations have high ROI and minimal complexity:

Preallocation: make([]T, 0, expectedSize) instead of growing slices
sync.Pool: Reuse frequently allocated objects
Connection pooling: Reuse HTTP/database connections
Buffered I/O: bufio.Reader with appropriate buffer size
Batch operations: Process N items together, not individually
Caching: Cache frequent queries/computations

Typical impact: 10-30% performance improvement, minutes to implement.

Medium Effort

These require more investigation but pay off on high-traffic services:

Serialization format change: JSON → Protobuf, MessagePack
Streaming instead of loading: Process large files incrementally
Parallelism tuning: GOMAXPROCS, goroutine pool sizing
Lock-free structures: Atomic operations, channel optimization
Memory limits: GOMEMLIMIT, GOGC tuning

Typical impact: 50-200% improvement, hours to implement and test.

High Effort

Architecture changes or custom implementations:

Custom data structures: Hash table, B-tree optimized for use case
Algorithmic improvements: O(n) → O(log n), reduce redundant work
System redesign: Move from request-response to streaming/batch
SIMD/assembly: Hand-optimized hot loops (rare in Go)

Typical impact: 2-10x improvement, days to weeks of development.

Anti-Patterns to Avoid

Premature Optimization

Optimize without profiling: wasted effort on wrong paths
Optimize cold paths: 1% of execution time, zero user impact
Readability vs speed: Choose readability unless profile proves otherwise

Micro-Benchmarking Without Profiling

Isolated benchmark: 10x faster
Real workload: 1.1x faster (other bottlenecks still exist)
Profile first, benchmark to verify improvements

Optimizing Without Measuring

Applied sync.Pool: "feels faster"
Really: 0.1% improvement, added complexity
Always measure before and after

Focusing on Binary Size Without Load Testing

"Smaller binary = faster startup"
Reality: Startup already 50ms, total request 2s; irrelevant
Measure where time is actually spent

Tools Reference

Tool	Use Case	When NOT to Use
`go test -bench`	Micro-benchmarks, tight loops	Whole system performance
`pprof -http`	Interactive CPU/memory profiling	Latency spikes (use tracing)
`go tool trace`	Goroutine scheduling, GC events	Steady-state CPU profiling
`benchstat`	Compare before/after benchmarks	Single measurement
`go tool pprof -base`	Diff two profiles over time	Real-time monitoring
`runtime/metrics`	GC stats, allocations	Binary size analysis
`syscall tracing`	System call overhead	Code-level CPU time
`go-torch`	Flamegraph visualization	Small, isolated benchmarks

Summary

Real-world optimization follows a pattern:

Measure first: Use pprof, benchmarks, load tests
Identify: Find the single biggest bottleneck
Hypothesize: Form specific theory about root cause
Optimize: Apply targeted fix (usually one of the case studies above)
Verify: Measure improvement, check for side effects
Repeat: Move to next bottleneck

The case studies show that solutions vary by domain:

API gateways: Serialization + pooling + connection reuse
Data pipelines: Batching + protocol choice (COPY vs INSERT)
CLI tools: Startup overhead + parallelism + file traversal
Message processing: Allocations + GC tuning + batch processing
Memory-constrained: Caching limits + streaming + GOGC tuning

Most services benefit from applying low-hanging fruit (preallocation, pooling, buffering) first, then measuring to identify the next bottleneck. Amdahl's law reminds us: the biggest 10% of your code determines 90% of performance. Find it. Optimize it. Done.

Real-World Performance Case Studies

On this page