Real-World Performance Case Studies
Practical performance optimization walkthroughs — from profiling to production, covering API gateways, data pipelines, CLI tools, and high-throughput services.
The Optimization Workflow
Performance optimization follows a scientific method: measure, hypothesize, experiment, verify. Skipping measurement is the most common mistake, leading to wasted effort on irrelevant optimizations.
The Process
-
Measure First
- Establish baseline metrics: throughput, latency, memory usage, CPU
- Use benchmarks, production profiling, and load tests
- Identify where time/resources are actually spent
-
Identify the Bottleneck
- CPU-bound: inlining failures, excessive allocations, hot loops
- Memory-bound: GC pressure, large allocations, cache misses
- I/O-bound: system calls, network roundtrips, disk seeks
- Contention: lock contention, channel blocking, goroutine starvation
-
Form a Hypothesis
- "JSON serialization takes 40% of request time"
- "Memory allocations trigger GC every 5ms"
- "Database roundtrips are the bottleneck"
-
Apply Targeted Optimization
- Change only one thing at a time
- Keep hypothesis focused (easy to debug if wrong)
-
Measure Again
- Verify hypothesis was correct
- Quantify improvement
- Check for side effects (latency increase with higher throughput)
-
Repeat
- Focus on next-biggest bottleneck
- Diminishing returns: expect 5-10% improvement per iteration
Amdahl's Law
If a bottleneck consumes 40% of execution time, the best possible speedup is 2.5x (100% / (60% + 40%/∞)). Optimize the biggest bottleneck first.
If you make operation A 10x faster:
- A takes 40% of time: overall speedup = 1.4x
- A takes 10% of time: overall speedup = 1.01x
Lesson: Find the biggest bottleneck. Small optimizations on cold paths waste effort.Case Study 1: REST API Gateway (50k req/s, p99 Latency Spikes)
Scenario
A microservices API gateway handling 50k requests per second shows p99 latency spikes from 8ms to 45ms. The service uses standard Go HTTP with encoding/json for request/response serialization.
Diagnosis: Profiling
# Capture CPU profile
curl http://localhost:6060/debug/pprof/profile?seconds=30 > cpu.prof
go tool pprof cpu.prof
# Interactive analysis
(pprof) top
Showing nodes accounting for 4200ms, 87.5% of 4800ms total
flat flat% sum% cum cum%
1200ms 25.0% 25.0% 2100ms 43.8% encoding/json.(*encodeState).string
850ms 17.7% 42.7% 2400ms 50.0% encoding/json.(*encodeState).value
620ms 12.9% 55.6% 620ms 12.9% runtime.mallocgc
...Profile reveals:
- 43.8% CPU in JSON serialization
- 12.9% in allocations (GC pressure)
- 15% in HTTP handler chain
Optimizations Applied
1. Switch to sonic (Fast JSON Serializer)
sonic uses code generation and SIMD for 3-5x JSON throughput:
// Before: encoding/json
import "encoding/json"
type Response struct {
ID int `json:"id"`
Name string `json:"name"`
Score float64 `json:"score"`
}
func handleRequest(w http.ResponseWriter, r *http.Request) {
resp := Response{ID: 1, Name: "Alice", Score: 98.5}
w.Header().Set("Content-Type", "application/json")
json.NewEncoder(w).Encode(resp) // ~2.5µs per response
}
// After: sonic (or easyjson)
import "github.com/bytedance/sonic"
func handleRequest(w http.ResponseWriter, r *http.Request) {
resp := Response{ID: 1, Name: "Alice", Score: 98.5}
w.Header().Set("Content-Type", "application/json")
data, _ := sonic.Marshal(resp) // ~0.6µs per response
w.Write(data)
}Alternatively, use easyjson code generation:
go install github.com/mailru/easyjson/cmd/easyjson@latest
easyjson -all response.go # Generates Response.MarshalJSON2. Request/Response Object Pooling
Reduce allocations with sync.Pool:
var requestBufferPool = sync.Pool{
New: func() interface{} {
return &RequestBuffer{
buf: make([]byte, 0, 8192),
headers: make(map[string]string),
}
},
}
type RequestBuffer struct {
buf []byte
headers map[string]string
}
func handleRequest(w http.ResponseWriter, r *http.Request) {
rb := requestBufferPool.Get().(*RequestBuffer)
defer requestBufferPool.Put(rb)
// Reset for reuse
rb.buf = rb.buf[:0]
for k := range rb.headers {
delete(rb.headers, k)
}
// Process request using pooled buffer
data, _ := sonic.Marshal(response)
rb.buf = append(rb.buf, data...)
w.Write(rb.buf)
}Impact: 60% fewer allocations per request.
3. Connection Pooling for Downstream Services
// Before: New connection per request
func callDownstream(ctx context.Context) (*Service, error) {
client := &http.Client{Timeout: 2 * time.Second}
resp, err := client.Get("http://internal-service:8080/data")
// Creates new connection, DNS lookup, TLS handshake
...
}
// After: Reused connection pool
var downstreamClient = &http.Client{
Timeout: 2 * time.Second,
Transport: &http.Transport{
MaxIdleConns: 100,
MaxIdleConnsPerHost: 10,
IdleConnTimeout: 90 * time.Second,
DisableKeepAlives: false,
},
}
func callDownstream(ctx context.Context) (*Service, error) {
resp, err := downstreamClient.Get("http://internal-service:8080/data")
// Reuses TCP connection, no handshakes
...
}4. HTTP Handler Middleware Optimization
// Before: Middleware chain with allocations
func chainMiddleware(h http.Handler, mw ...func(http.Handler) http.Handler) http.Handler {
for i := len(mw) - 1; i >= 0; i-- {
h = mw[i](h) // Wraps handler, adds allocation
}
return h
}
// After: Minimal allocation handler chain
type Handler struct {
fn func(http.ResponseWriter, *http.Request)
next *Handler
}
func (h *Handler) ServeHTTP(w http.ResponseWriter, r *http.Request) {
h.fn(w, r)
if h.next != nil {
h.next.ServeHTTP(w, r)
}
}5. Pre-Computed Responses for Common Queries
// Cache frequently requested data
var (
staticResponseCache = sync.Map{}
cacheTTL = 1 * time.Second
cacheUpdateTicker = time.NewTicker(cacheTTL)
)
func handleUsersList(w http.ResponseWriter, r *http.Request) {
if cached, ok := staticResponseCache.Load("users_list"); ok {
w.Header().Set("Content-Type", "application/json")
w.Write(cached.([]byte))
return
}
// Compute only on cache miss
users := getUsers()
data, _ := sonic.Marshal(users)
staticResponseCache.Store("users_list", data)
w.Write(data)
}Results
Metric Before After Improvement
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Throughput 50k req/s 150k req/s 3x
P50 Latency 2ms 1.5ms 25% better
P99 Latency 45ms 8ms 82% better (MAJOR)
Allocations/req 12 3 75% fewer
GC Pause 2.1ms 0.4ms 81% reduction
Memory Usage 850MB 420MB 50% reduction
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━The p99 latency spike was caused by GC pauses triggered by excessive allocations. Fixing serialization and pooling eliminated the spike entirely.
Case Study 2: Data Pipeline / ETL (10M Records CSV → Postgres)
Scenario
A data pipeline processes 10M CSV records (2GB file), transforms each row, and loads into PostgreSQL. Initial implementation takes 45 minutes.
Diagnosis
Profile with pprof and observe:
Time breakdown (45 minutes total):
- Reading CSV: 15 minutes (33%)
- Per-row allocation: 12 minutes (27%)
- INSERT statements: 18 minutes (40%)Bottleneck: database roundtrips. Each INSERT statement requires:
- Query parsing
- Plan preparation
- Execution
- Network roundtrip
Optimizations Applied
1. Buffered Reader with Larger Buffer
// Before: Default buffered reader
import "encoding/csv"
file, _ := os.Open("data.csv")
reader := csv.NewReader(file) // Default 4KB buffer
for {
record, _ := reader.Read()
processRow(record)
}
// After: 64KB buffer + larger buffer pool
reader := csv.NewReader(file)
reader.Buffer = make([]byte, 64*1024) // 64KB vs 4KB
for {
record, _ := reader.Read()
processRow(record)
}Impact: ~10% faster file reading (fewer syscalls).
2. Pre-Allocated Slices for Batch Processing
// Before: Append to slice per row
var rows []Row
for {
record, _ := reader.Read()
rows = append(rows, parseRow(record)) // Reallocation every ~64 rows
}
// After: Pre-allocate with capacity
rows := make([]Row, 0, 100000) // Batch of 100k
for i := 0; i < 10000000; i++ {
record, _ := reader.Read()
rows = append(rows, parseRow(record))
if len(rows) == cap(rows) {
loadBatch(rows)
rows = rows[:0] // Reuse slice
}
}3. COPY Protocol Instead of INSERT
The biggest optimization: use PostgreSQL's COPY protocol instead of individual INSERTs.
// Before: Individual INSERT statements
import "database/sql"
db, _ := sql.Open("postgres", connStr)
for _, row := range rows {
db.Exec(
"INSERT INTO users (id, name, email) VALUES ($1, $2, $3)",
row.ID, row.Name, row.Email,
) // Each: parse + plan + execute + network roundtrip = ~10ms
}
// After: COPY bulk insert
import "github.com/jackc/pgx/v5"
conn, _ := pgx.Connect(context.Background(), connStr)
rows := make([][]interface{}, 0, 100000)
for record := range csvRecords {
rows = append(rows, []interface{}{record.ID, record.Name, record.Email})
if len(rows) == 100000 {
// Bulk insert via COPY (single roundtrip for 100k rows)
conn.CopyFrom(
context.Background(),
[]string{"id", "name", "email"},
pgx.CopyFromRows(rows),
)
rows = rows[:0]
}
}Performance comparison:
- INSERT: 1 statement = 1 roundtrip = ~10ms. For 10M rows = 100,000 seconds
- COPY: 100,000 rows = 1 roundtrip = ~50ms. For 10M rows = 5 seconds
4. Worker Pool for Parallel Chunk Processing
const numWorkers = 4
type ChunkJob struct {
records [][]string
}
jobChan := make(chan ChunkJob, numWorkers)
// Start worker pool
for i := 0; i < numWorkers; i++ {
go func() {
conn, _ := pgx.Connect(context.Background(), connStr)
defer conn.Close(context.Background())
for job := range jobChan {
rows := make([][]interface{}, len(job.records))
for i, record := range job.records {
rows[i] = []interface{}{record[0], record[1], record[2]}
}
conn.CopyFrom(context.Background(), []string{"id", "name", "email"}, pgx.CopyFromRows(rows))
}
}()
}
// Producer: read CSV, chunk, send to workers
var chunk [][]string
for {
record, _ := reader.Read()
chunk = append(chunk, record)
if len(chunk) == 100000 {
jobChan <- ChunkJob{records: chunk}
chunk = make([][]string, 0, 100000)
}
}
close(jobChan)Impact: 4 parallel database connections = 4x faster loading.
5. Pipeline Pattern: Read → Transform → Load Stages
// Stage 1: CSV Reader
type CSVRecord []string
csvChan := make(chan CSVRecord, 10000)
go func() {
reader := csv.NewReader(file)
for {
record, _ := reader.Read()
csvChan <- record
}
close(csvChan)
}()
// Stage 2: Transform
type UserRow struct {
ID int
Name string
Email string
}
transformChan := make(chan UserRow, 10000)
go func() {
for record := range csvChan {
transformChan <- UserRow{
ID: parseID(record[0]),
Name: record[1],
Email: record[2],
}
}
close(transformChan)
}()
// Stage 3: Batch & Load
go func() {
var batch []UserRow
for row := range transformChan {
batch = append(batch, row)
if len(batch) == 100000 {
loadBatchVIACOPY(batch)
batch = batch[:0]
}
}
}()Results
Metric Before After Improvement
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Total Time 45 minutes 3 minutes 15x faster
CSV Read Time 15 min 13 min 13% (buffering helped)
Database Load Time 18 min 2.5 min 7x (COPY + parallelism)
Per-Record Alloc 0.6µs 0.04µs 15x fewer
Memory Peak 2.5GB 420MB 84% reduction
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━The critical insight: don't optimize I/O in isolation. The real bottleneck was architecture (INSERT vs COPY protocol).
Case Study 3: CLI File Search Tool (100k Files, Slow Startup)
Scenario
A "grep-like" file search tool scans 100k files looking for pattern matches. Startup takes 200ms, actual search takes 150ms. Users expect under 100ms for snappy CLI feel.
Diagnosis
$ time ./search "pattern" /large/directory
# Breakdown with detailed timing:
real 0m0.200s (startup)
user 0m0.048s (search)
sys 0m0.032s (file I/O)Problems identified:
- Startup overhead: Regexp compiled on every run (30ms)
- File traversal: Using filepath.Walk calls os.Stat per file (50ms lost)
- Single-threaded: No parallelism (file traversal is I/O-bound)
Optimizations Applied
1. Compile Regexp Once (Global Variable)
// Before: Compile per invocation
func main() {
pattern := flag.String("pattern", "", "search pattern")
flag.Parse()
re, _ := regexp.Compile(*pattern) // 10-30ms per startup
search(re)
}
// After: Compile at init or parse arguments once
var compiledRegex *regexp.Regexp
func init() {
// Or compile in main() once and reuse
}
func main() {
pattern := flag.String("pattern", "", "search pattern")
flag.Parse()
var err error
compiledRegex, err = regexp.Compile(*pattern) // Compile ONCE
if err != nil {
log.Fatal(err)
}
search(compiledRegex)
}Impact: -30ms startup.
2. WalkDir Instead of Walk (No Stat Per File)
// Before: filepath.Walk calls os.Stat per entry
filepath.Walk(dir, func(path string, info os.FileInfo, err error) error {
if !info.IsDir() {
searchFile(path)
}
return nil
})
// Each entry: lstat syscall = 0.5ms × 100k files = 50 seconds wasted
// After: filepath.WalkDir includes FileInfo without extra syscalls
filepath.WalkDir(dir, func(path string, d fs.DirEntry, err error) error {
if !d.IsDir() {
searchFile(path)
}
return nil
})
// No extra lstat: stat is already in DirEntry = 0 extra syscallsImpact: -50ms (huge for file traversal).
3. Parallel Directory Traversal
// Before: Sequential scan
func searchDir(path string) {
filepath.WalkDir(path, func(path string, d fs.DirEntry, err error) error {
if !d.IsDir() {
searchFile(path)
}
return nil
})
}
// After: Goroutine pool for directory recursion
const maxConcurrency = 8
sem := make(chan struct{}, maxConcurrency)
func searchDirParallel(path string, wg *sync.WaitGroup) {
defer wg.Done()
entries, _ := os.ReadDir(path)
for _, e := range entries {
if e.IsDir() {
wg.Add(1)
sem <- struct{}{} // Acquire semaphore
go func(subdir string) {
defer func() { <-sem }()
searchDirParallel(subdir, wg)
}(filepath.Join(path, e.Name()))
} else {
searchFile(filepath.Join(path, e.Name()))
}
}
}Impact: 4x faster traversal (4 goroutines × files spread across SSDs/HDD).
4. Binary Size Optimization
# Default binary: 15MB
go build -o search
# Production binary: 4.2MB
go build -ldflags="-s -w" -trimpath CGO_ENABLED=0 -o search
# With UPX: 1.8MB (startup cost: 30ms decompression)
upx --best searchFor CLI tools, 30ms UPX decompression is often acceptable if saving bandwidth matters.
5. Pre-Compiled Search Patterns
If patterns are known ahead of time:
# Generate code for patterns at build time
go generate ./...
// pattern_codegen.go
//go:generate patterns_generator
// Generated file: patterns_generated.go
var patterns = []*regexp.Regexp{
regexp.MustCompile("error"),
regexp.MustCompile("warning"),
regexp.MustCompile("fatal"),
}
// Compiled at init(), no runtime compilationResults
Metric Before After Improvement
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Startup Time 200ms 30ms 6.6x
Traversal Time 150ms 40ms 3.7x
Total (Startup+Search) 200ms 70ms 2.8x
Binary Size 15MB 4.2MB 72% smaller
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━Startup is now snappy (under 100ms). The remaining 40ms is core file traversal (hard to optimize further without architecture change like indexing).
Case Study 4: High-Throughput Kafka Message Processor (500k msg/s, Falling Behind)
Scenario
A service consuming from Kafka at 500k messages/second, performing light transformation, and writing to ClickHouse. Consumer is falling behind (lag increasing 10k messages/minute), consuming 80% CPU.
Diagnosis
GC profile shows 50% CPU in garbage collection. CPU profile reveals:
CPU Profile:
- JSON unmarshaling: 25% CPU
- Per-message allocations: 20% CPU
- GC: 50% CPU (mark/sweep)
- ClickHouse INSERTS: 5% CPU
Memory Profile:
- 500k msgs/sec × 2KB per message = 1GB/sec allocation rate
- GC target: 2GB heap
- GC triggered every 2 seconds (pause: 200ms each)Problem: Excessive allocations trigger frequent GC pauses, causing lag.
Optimizations Applied
1. Object Pooling for Message Structs
// Message structure
type KafkaMessage struct {
ID string `json:"id"`
Timestamp int64 `json:"timestamp"`
Data map[string]interface{} `json:"data"`
Metadata map[string]string `json:"metadata"`
}
// Before: Allocate new message per record
func processMessage(data []byte) {
var msg KafkaMessage
json.Unmarshal(data, &msg)
handleMessage(&msg)
// msg goes out of scope → GC candidate
}
// After: Pool reusable message structs
var msgPool = sync.Pool{
New: func() interface{} {
return &KafkaMessage{
Data: make(map[string]interface{}),
Metadata: make(map[string]string),
}
},
}
func processMessage(data []byte) {
msg := msgPool.Get().(*KafkaMessage)
defer msgPool.Put(msg)
// Reset maps
for k := range msg.Data {
delete(msg.Data, k)
}
for k := range msg.Metadata {
delete(msg.Metadata, k)
}
json.Unmarshal(data, msg)
handleMessage(msg)
// msg returned to pool, reused next iteration
}Impact: ~70% fewer allocations.
2. Batch Processing
// Before: Process message immediately
for msg := range kafkaConsumer {
processMessage(msg)
writeToClickHouse(transformed) // 500k inserts/sec
}
// After: Batch 1000 messages, process together
const batchSize = 1000
batch := make([]*TransformedMessage, 0, batchSize)
for msg := range kafkaConsumer {
transformed := processMessage(msg)
batch = append(batch, transformed)
if len(batch) == batchSize {
writeClickHouseBatch(batch)
batch = batch[:0]
}
}Impact: 1000x fewer database roundtrips (500k inserts → 500 batch inserts).
3. Switch to Protobuf with vtprotobuf
// Before: JSON unmarshaling (slower, larger messages)
import "encoding/json"
type Message struct {
ID string `json:"id"`
Type int `json:"type"`
Data string `json:"data"`
}
// After: Protobuf (compact, fast unmarshaling)
import "google.golang.org/protobuf/proto"
// message.proto
syntax = "proto3";
message Message {
string id = 1;
int32 type = 2;
string data = 3;
}
// With vtprotobuf:
// go install github.com/planetscale/vtprotobuf/cmd/protoc-gen-go-vtproto@latest
// Generate with: protoc --go_out=. --go-vtproto_out=. message.protoProtobuf benefits:
- ~40% smaller message size
- ~3x faster unmarshaling
- ~5x less allocations during unmarshaling
4. GOMEMLIMIT and GOGC Tuning
// Set memory limit for GC
import "runtime/debug"
func init() {
// Go 1.19+: Hard limit on heap
debug.SetMemoryLimit(5 * 1024 * 1024 * 1024) // 5GB max
// GOGC: Control GC frequency (default 100 = GC when heap doubles)
// Lower GOGC = more frequent GC, less pausing
// Higher GOGC = less frequent GC, longer pauses
}
// In deployment:
// GOGC=50 (GC when heap grows 50%) - more frequent but shorter pauses
// GOMEMLIMIT=5GiB go run main.go5. Partitioned Channels to Reduce Contention
// Before: Single channel bottleneck
msgChan := make(chan *Message, 1000)
for i := 0; i < 4; i++ {
go func() {
for msg := range msgChan {
process(msg)
}
}()
}
// After: Partition by message ID hash
const numPartitions = 16
channels := make([]chan *Message, numPartitions)
for i := 0; i < numPartitions; i++ {
channels[i] = make(chan *Message, 1000)
}
// Send messages to partitions
go func() {
for msg := range kafkaConsumer {
partition := hashID(msg.ID) % numPartitions
channels[partition] <- msg
}
}()
// Process each partition independently
for i := 0; i < numPartitions; i++ {
for j := 0; j < 4; j++ {
go func(ch chan *Message) {
for msg := range ch {
process(msg)
}
}(channels[i])
}
}Impact: Reduces channel lock contention from single serialization point.
Results
Metric Before After Improvement
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Throughput 500k msg/s 1.2M msg/s 2.4x
GC Pause Time 200ms 8ms 96% reduction
GC CPU Time 50% 5% 90% reduction
Allocations per Message 8 1 87.5% fewer
Kafka Consumer Lag +10k/min -50k/min Catching up
P99 Latency 150ms 12ms 92% reduction
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━The service is now sustaining 1.2M msg/s with minimal GC impact. Lag decreases instead of increasing.
Case Study 5: Memory-Constrained Service (256MB Limit, Frequent OOMs)
Scenario
A microservice in Kubernetes with 256MB memory limit crashes with OOMKilled events 3-5 times per day. The service runs fine for 6-12 hours, then crashes suddenly.
Diagnosis
Memory profile shows:
Alloc = 240MB Sys = 380MB NumGC = 12345
HeapAlloc = 240MB HeapSys = 380MB HeapIdle = 60MB
HeapInuse = 320MB HeapReleased = 0MB HeapObjects = 1200000
Problem: Heap never shrinks. Fragmented, no memory returned to OS.Root causes:
- Unbounded cache: LRU cache grows to 200MB, never evicts
- Large temporary allocations: Request processing allocates 5MB buffers
- Memory fragmentation: Go can't return freed memory to OS
Optimizations Applied
1. LRU Cache with Size Limit
// Before: Unbounded cache
type UnboundedCache struct {
data map[string]interface{}
mu sync.RWMutex
}
func (c *UnboundedCache) Get(key string) interface{} {
c.mu.RLock()
defer c.mu.RUnlock()
return c.data[key]
}
func (c *UnboundedCache) Set(key string, value interface{}) {
c.mu.Lock()
defer c.mu.Unlock()
c.data[key] = value // Unbounded growth
}
// After: LRU with maximum size
type BoundedLRUCache struct {
data map[string]interface{}
lru *lru.Cache // github.com/hashicorp/golang-lru
mu sync.RWMutex
maxSize int
}
func (c *BoundedLRUCache) Set(key string, value interface{}) {
c.mu.Lock()
defer c.mu.Unlock()
c.lru.Add(key, value) // Evicts oldest if at capacity
if c.lru.Len() > c.maxSize {
c.lru.RemoveOldest()
}
}
// Initialize with 10MB max size
cache, _ := lru.New(100000) // 10000 items × 1KB avg = 10MB2. Streaming Processing Instead of Loading Full Dataset
// Before: Load entire file into memory
func processFile(filename string) error {
data, _ := ioutil.ReadFile(filename) // 50MB file loaded at once
var records []Record
json.Unmarshal(data, &records)
for _, record := range records {
process(record)
}
return nil
}
// After: Stream processing
func processFile(filename string) error {
file, _ := os.Open(filename)
defer file.Close()
decoder := json.NewDecoder(file)
decoder.UseNumber() // Don't parse numbers
for {
var record Record
if err := decoder.Decode(&record); err == io.EOF {
break
}
process(record)
// Memory: constant ~10KB, not 50MB
}
return nil
}3. GOMEMLIMIT and GOGC Tuning
// Force aggressive GC and memory limits
import "runtime/debug"
func init() {
// Hard memory limit: 200MB (leave 56MB for OS/buffers)
debug.SetMemoryLimit(200 * 1024 * 1024)
// GOGC=30: Trigger GC frequently (less memory held, more CPU)
// Tradeoff: 2% CPU increase, zero OOMs
}Kubernetes deployment:
apiVersion: v1
kind: Pod
metadata:
name: memory-constrained-service
spec:
containers:
- name: app
image: myapp:latest
env:
- name: GOMEMLIMIT
value: "200MiB"
- name: GOGC
value: "30"
resources:
limits:
memory: "256Mi"
cpu: "500m"4. Escape Analysis Fixes: Return Values to Stack
// Before: Allocates on heap
func newConfig() *Config {
return &Config{
Name: "default",
Timeout: 30 * time.Second,
}
}
// After: Return by value (stack allocated if inlined)
func newConfig() Config {
return Config{
Name: "default",
Timeout: 30 * time.Second,
}
}
// Caller
cfg := newConfig() // Allocated on stack, zero heap pressure5. Arena Allocation for Request-Scoped Data
// Go 1.20+: Arena for fast allocation/deallocation
import "arena"
func handleRequest(w http.ResponseWriter, r *http.Request) {
// Allocate all request data in one arena
a := arena.NewArena()
defer a.Free()
// Allocations using arena.New() within this scope
requestData := arena.New[RequestData](a)
requestData.Parse(r.Body)
response := processRequest(requestData, a)
w.Write(response)
// All allocations freed when arena is freed
}Impact: Request-scoped allocations freed together, preventing fragmentation.
Results
Metric Before After Improvement
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Heap Usage (Idle) 240MB 120MB 50% reduction
Peak Heap Usage 250MB 180MB 28% reduction
OOM Events per Week 10-15 0 100% elimination
Cache Size Limit Unbounded 50MB Capped
GC Pause Time 50ms 15ms 70% reduction
CPU Usage (from GC) 12% 8% 33% reduction
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━Service now runs indefinitely within 256MB limit. No more OOMKilled events.
Optimization Checklist
Low-Hanging Fruit (Always Check First)
These optimizations have high ROI and minimal complexity:
- Preallocation:
make([]T, 0, expectedSize)instead of growing slices - sync.Pool: Reuse frequently allocated objects
- Connection pooling: Reuse HTTP/database connections
- Buffered I/O:
bufio.Readerwith appropriate buffer size - Batch operations: Process N items together, not individually
- Caching: Cache frequent queries/computations
Typical impact: 10-30% performance improvement, minutes to implement.
Medium Effort
These require more investigation but pay off on high-traffic services:
- Serialization format change: JSON → Protobuf, MessagePack
- Streaming instead of loading: Process large files incrementally
- Parallelism tuning: GOMAXPROCS, goroutine pool sizing
- Lock-free structures: Atomic operations, channel optimization
- Memory limits: GOMEMLIMIT, GOGC tuning
Typical impact: 50-200% improvement, hours to implement and test.
High Effort
Architecture changes or custom implementations:
- Custom data structures: Hash table, B-tree optimized for use case
- Algorithmic improvements: O(n) → O(log n), reduce redundant work
- System redesign: Move from request-response to streaming/batch
- SIMD/assembly: Hand-optimized hot loops (rare in Go)
Typical impact: 2-10x improvement, days to weeks of development.
Anti-Patterns to Avoid
Premature Optimization
- Optimize without profiling: wasted effort on wrong paths
- Optimize cold paths: 1% of execution time, zero user impact
- Readability vs speed: Choose readability unless profile proves otherwise
Micro-Benchmarking Without Profiling
- Isolated benchmark: 10x faster
- Real workload: 1.1x faster (other bottlenecks still exist)
- Profile first, benchmark to verify improvements
Optimizing Without Measuring
- Applied sync.Pool: "feels faster"
- Really: 0.1% improvement, added complexity
- Always measure before and after
Focusing on Binary Size Without Load Testing
- "Smaller binary = faster startup"
- Reality: Startup already 50ms, total request 2s; irrelevant
- Measure where time is actually spent
Tools Reference
| Tool | Use Case | When NOT to Use |
|---|---|---|
go test -bench | Micro-benchmarks, tight loops | Whole system performance |
pprof -http | Interactive CPU/memory profiling | Latency spikes (use tracing) |
go tool trace | Goroutine scheduling, GC events | Steady-state CPU profiling |
benchstat | Compare before/after benchmarks | Single measurement |
go tool pprof -base | Diff two profiles over time | Real-time monitoring |
runtime/metrics | GC stats, allocations | Binary size analysis |
syscall tracing | System call overhead | Code-level CPU time |
go-torch | Flamegraph visualization | Small, isolated benchmarks |
Summary
Real-world optimization follows a pattern:
- Measure first: Use pprof, benchmarks, load tests
- Identify: Find the single biggest bottleneck
- Hypothesize: Form specific theory about root cause
- Optimize: Apply targeted fix (usually one of the case studies above)
- Verify: Measure improvement, check for side effects
- Repeat: Move to next bottleneck
The case studies show that solutions vary by domain:
- API gateways: Serialization + pooling + connection reuse
- Data pipelines: Batching + protocol choice (COPY vs INSERT)
- CLI tools: Startup overhead + parallelism + file traversal
- Message processing: Allocations + GC tuning + batch processing
- Memory-constrained: Caching limits + streaming + GOGC tuning
Most services benefit from applying low-hanging fruit (preallocation, pooling, buffering) first, then measuring to identify the next bottleneck. Amdahl's law reminds us: the biggest 10% of your code determines 90% of performance. Find it. Optimize it. Done.