Write precise benchmarks in Go using testing.B, avoid pitfalls, and compare results with benchstat.

Benchmarking Best Practices

Benchmarking measures code performance objectively. Go's testing package provides excellent built-in benchmarking capabilities, but writing correct benchmarks requires understanding common pitfalls, compiler behavior, dead code elimination, profiling integration, and statistical analysis.

Writing Your First Benchmark

Benchmarks follow the same pattern as unit tests but use testing.B:

package main

import (
    "testing"
)

func Reverse(s string) string {
    runes := []rune(s)
    for i, j := 0, len(runes)-1; i < j; i, j = i+1, j-1 {
        runes[i], runes[j] = runes[j], runes[i]
    }
    return string(runes)
}

func BenchmarkReverse(b *testing.B) {
    s := "hello world"
    b.ResetTimer() // Important: exclude setup time

    for i := 0; i < b.N; i++ {
        Reverse(s)
    }
}

Run benchmarks:

go test -bench=. -benchmem

Output:

BenchmarkReverse-8    500000000    2.45 ns/op    0 B/op    0 allocs/op

This means: 500 million iterations, 2.45 nanoseconds per operation, 0 bytes allocated, 0 allocations per operation.

Dead Code Elimination (DCE): The Compiler's Optimization Threat

The Go compiler aggressively optimizes away code with unused results. Benchmarks that compute values that are never used may not measure what you think:

// WRONG: Compiler optimizes away entire computation
func BenchmarkWrongSum(b *testing.B) {
    for i := 0; i < b.N; i++ {
        sum := 0
        for j := 0; j < 1000; j++ {
            sum += j
        }
        // sum is never used - compiler removes entire loop!
    }
}

// Actual measured time: ~0.1ns/op (compiler proved sum = 0 at compile time)

The Global Sink Variable Pattern

The correct pattern is to assign results to a package-level variable that the compiler cannot prove is unused:

// Global sink variable - prevents DCE
var Sink interface{}

func BenchmarkCorrectSum(b *testing.B) {
    var sum int64
    b.ResetTimer()

    for i := 0; i < b.N; i++ {
        sum = 0
        for j := 0; j < 1000; j++ {
            sum += int64(j)
        }
    }
    Sink = sum // Prevents optimization - compiler can't prove sum is unused
}

// Actual measured time: ~5.2µs/op (real computation)

Why does this work? The compiler sees that:

sum is assigned in the loop
sum is stored to the global Sink variable
Sink is exported (could be read from another package)
Therefore, the compiler must preserve the computation

Different sink types for different use cases:

// For different types
var SinkInt int64
var SinkString string
var SinkBytes []byte
var SinkBool bool
var SinkFloat float64

// Or generic:
var Sink interface{}

// For interface{} sinks, there's a small type assertion overhead (~1-2ns)
// For typed sinks, overhead is negligible

func BenchmarkTypedVsInterface(b *testing.B) {
    b.Run("typed", func(b *testing.B) {
        for i := 0; i < b.N; i++ {
            SinkInt = int64(i * i)
        }
    })

    b.Run("interface", func(b *testing.B) {
        for i := 0; i < b.N; i++ {
            Sink = int64(i * i)
        }
    })
}

// Benchmark results:
// BenchmarkTypedVsInterface/typed-8       3000000000      0.34 ns/op
// BenchmarkTypedVsInterface/interface-8   2000000000      0.65 ns/op (type assertion ~0.3ns)

Preventing Function Inlining: The //go:noinline Pragma

Sometimes the compiler inlines functions, which distorts benchmarks by eliminating function call overhead:

// With inlining
func add(a, b int) int {
    return a + b
}

func BenchmarkInlined(b *testing.B) {
    for i := 0; i < b.N; i++ {
        SinkInt = int64(add(1, 2))
    }
}

// The add function call disappears, compiler optimizes to SinkInt = 3

// With noinline pragma
//go:noinline
func addNoInline(a, b int) int {
    return a + b
}

func BenchmarkNoInline(b *testing.B) {
    for i := 0; i < b.N; i++ {
        SinkInt = int64(addNoInline(1, 2))
    }
}

// Benchmark results:
// BenchmarkInlined-8         3000000000      0.34 ns/op (no call overhead)
// BenchmarkNoInline-8        2000000000      0.64 ns/op (includes call overhead ~0.3ns)

runtime.KeepAlive: Preventing GC of Benchmarked Objects

When benchmarking operations that create objects that might be garbage collected:

func BenchmarkSliceAlloc(b *testing.B) {
    b.ReportAllocs()

    for i := 0; i < b.N; i++ {
        s := make([]byte, 1000)
        // Without KeepAlive, compiler might optimize away allocation
        runtime.KeepAlive(s)
    }
}

// KeepAlive tells the compiler: "this object is live here"
// Prevents premature GC and allocation optimization

Timer Control: ResetTimer, StopTimer, StartTimer

Proper timer management is crucial for accurate measurements:

func BenchmarkWithSetup(b *testing.B) {
    // Setup happens outside the timer (not measured)
    data := make([]int, 10000)
    for i := range data {
        data[i] = i
    }

    b.ResetTimer() // Start measuring from here

    for i := 0; i < b.N; i++ {
        processData(data)
    }
}

func BenchmarkPausedWork(b *testing.B) {
    data := generateData()

    for i := 0; i < b.N; i++ {
        b.StopTimer()
        // Cleanup per iteration (not measured)
        cleanupIter := expensivePerIterationSetup()
        b.StartTimer()

        processWithCleanup(data, cleanupIter)
    }
}

Warning about StopTimer/StartTimer: These call runtime.ReadMemStats() internally, which is expensive (~microseconds). Only use for per-iteration setup that truly cannot be pre-computed.

// WRONG: StopTimer/StartTimer called 10,000,000 times
func BenchmarkBadStopStart(b *testing.B) {
    for i := 0; i < b.N; i++ {
        b.StopTimer()
        // Small work
        b.StartTimer()
        // Small work
    }
}

// Benchmark results: Dominated by timer overhead, not actual work!

// RIGHT: Factor out the per-iteration setup
func BenchmarkGoodSetup(b *testing.B) {
    setupData := generateAllSetupData(b.N)
    b.ResetTimer()

    for i := 0; i < b.N; i++ {
        process(setupData[i])
    }
}

Reporting Allocations with b.ReportAllocs

func BenchmarkAllocations(b *testing.B) {
    b.ReportAllocs() // Enable allocation reporting

    for i := 0; i < b.N; i++ {
        // Each iteration allocates
        s := make([]byte, 100)
        _ = s
    }
}

// Run with -benchmem or use b.ReportAllocs():
// BenchmarkAllocations-8    10000000    120 ns/op    100 B/op    1 allocs/op

Interpret the output:

B/op: Bytes allocated per operation
allocs/op: Number of allocations per operation (calls to malloc)

Lower is better for both. Even allocations that are freed quickly increase GC pressure and affect latency.

// Example: Identifying allocation inefficiencies
var Sink []byte

func BenchmarkAllocationPatterns(b *testing.B) {
    b.Run("Efficient", func(b *testing.B) {
        b.ReportAllocs()
        for i := 0; i < b.N; i++ {
            // Pre-allocate exact capacity
            buf := make([]byte, 0, 1000)
            for j := 0; j < 1000; j++ {
                buf = append(buf, byte(j))
            }
            Sink = buf
        }
    })

    b.Run("Inefficient", func(b *testing.B) {
        b.ReportAllocs()
        for i := 0; i < b.N; i++ {
            // No pre-allocation, grows dynamically
            var buf []byte
            for j := 0; j < 1000; j++ {
                buf = append(buf, byte(j))
            }
            Sink = buf
        }
    })
}

// Results:
// BenchmarkAllocationPatterns/Efficient-8      10000    150000 ns/op    1024 B/op    1 allocs/op
// BenchmarkAllocationPatterns/Inefficient-8     1000   5000000 ns/op   10240 B/op   11 allocs/op

// Inefficient version:
// - 5x slower (10 reallocations as slice grows)
// - 10x more memory (larger final size from overallocation)
// - 11 allocations vs 1 (each append triggers reallocation)

Sub-Benchmarks with b.Run()

Organize related benchmarks into families:

func BenchmarkStringOps(b *testing.B) {
    b.Run("Concat", func(b *testing.B) {
        b.ReportAllocs()
        for i := 0; i < b.N; i++ {
            _ = "hello" + "world"
        }
    })

    b.Run("Builder", func(b *testing.B) {
        b.ReportAllocs()
        for i := 0; i < b.N; i++ {
            var sb strings.Builder
            sb.WriteString("hello")
            sb.WriteString("world")
            _ = sb.String()
        }
    })

    b.Run("Join", func(b *testing.B) {
        b.ReportAllocs()
        for i := 0; i < b.N; i++ {
            _ = strings.Join([]string{"hello", "world"}, "")
        }
    })
}

// Run specific benchmarks:
// go test -bench=StringOps/Builder
// go test -bench=StringOps

Table-Driven Benchmarks

Benchmark functions with varying inputs:

func BenchmarkParseInt(b *testing.B) {
    tests := []struct {
        name  string
        input string
    }{
        {"Small", "42"},
        {"Medium", "123456789"},
        {"Large", "9223372036854775807"},
        {"Negative", "-9223372036854775808"},
    }

    for _, tt := range tests {
        b.Run(tt.name, func(b *testing.B) {
            b.ReportAllocs()
            for i := 0; i < b.N; i++ {
                _, _ = strconv.ParseInt(tt.input, 10, 64)
            }
        })
    }
}

// Results show performance differences by input size

Benchmark Stability and Gotchas

Gotcha 1: Timer Resolution

Different platforms have different timer resolutions:

// On Linux: ~1 nanosecond resolution
// On Windows/macOS: ~100 nanosecond resolution

// If your benchmark runs in < 100ns on Windows, results are unreliable
// Solution: Run more iterations

func BenchmarkTinyOp(b *testing.B) {
    // This is too fast to measure reliably
    for i := 0; i < b.N; i++ {
        _ = 1 + 1
    }
}

// Better: Batch operations
func BenchmarkTinyOpBatched(b *testing.B) {
    for i := 0; i < b.N; i++ {
        sum := 0
        for j := 0; j < 1000; j++ {
            sum += 1
        }
        Sink = sum
    }
}

Gotcha 2: Compiler Optimizations Between Iterations

The compiler may optimize differently based on runtime profiles:

// Go's compiler performs profile-guided optimizations
// Multiple runs can have slightly different performance due to:
// - Branch prediction behavior
// - CPU cache state
// - Power management transitions

// Solution: Use -count flag to run multiple times
// go test -bench=. -count=5  # Run each benchmark 5 times

Gotcha 3: CPU Frequency Scaling

Laptop/server CPU frequency scaling affects results:

# On Linux, check current CPU frequency
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_driver
cat /proc/cpuinfo | grep MHz

# Disable frequency scaling for consistent results
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

# Run benchmarks, then restore
echo powersave | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

Gotcha 4: Cache Warming - First Iterations vs Steady State

The first few iterations might be slower due to cold caches:

func BenchmarkWithCacheWarming(b *testing.B) {
    // Array larger than L3 cache (~20MB on modern CPUs)
    data := make([]int, 10_000_000)
    for i := range data {
        data[i] = i
    }

    b.ResetTimer()

    for i := 0; i < b.N; i++ {
        // First iteration: cache miss penalty
        // Subsequent iterations: potentially faster due to cache hits
        sumSequential(data)
    }
}

// To measure cold cache performance:
// Disable L3 cache before benchmark, or use different cache-busting data

Gotcha 5: Power Management

Server CPUs maintain performance states that affect benchmarking:

// Some CPUs have C-states that reduce frequency when idle
// Results vary based on system load and thermal state

// On Linux, check C-states:
cat /sys/devices/system/cpu/cpu0/cpuidle/state0/name

// Disable C-states for benchmarking (requires root):
echo 1 > /sys/devices/system/cpu/cpu0/cpuidle/state0/disable

Custom Benchmark Metrics with b.ReportMetric

Report custom measurements beyond time and allocations:

func BenchmarkThroughput(b *testing.B) {
    b.ReportAllocs()

    itemsProcessed := 0

    for i := 0; i < b.N; i++ {
        // Process 100 items per iteration
        itemsProcessed += 100
        doWork()
    }

    b.ReportMetric(float64(itemsProcessed), "items")
    b.ReportMetric(float64(itemsProcessed)/b.Elapsed().Seconds(), "items/sec")
}

// Output:
// BenchmarkThroughput-8    1000     1234567 ns/op    100 items    80.7k items/sec

CPU Profiling During Benchmarks

Combine benchmarks with CPU profiling:

# Generate CPU profile
go test -bench=BenchmarkExpensive -cpuprofile=cpu.prof

# Analyze with pprof
go tool pprof cpu.prof

# In pprof:
# top10          - Show top 10 functions by CPU time
# list main.Expensive  - Show annotated source
# graph          - Generate call graph

Example analysis:

func BenchmarkJsonParse(b *testing.B) {
    jsonData := []byte(`{"name":"John","age":30,"city":"New York"}`)
    b.ResetTimer()

    for i := 0; i < b.N; i++ {
        var result map[string]interface{}
        json.Unmarshal(jsonData, &result)
    }
}

// go test -bench=BenchmarkJsonParse -cpuprofile=cpu.prof
// go tool pprof cpu.prof
//
// (pprof) top10
// Showing nodes accounting for 1500ms, 75% of 2000ms total
// flat  flat%   sum%        cum   cum%
// 800ms 40.0% 40.0%       800ms 40.0%  encoding/json.(*Decoder).readValue
// 400ms 20.0% 60.0%       400ms 20.0%  runtime.mallocgc
// 300ms 15.0% 75.0%       700ms 35.0%  reflect.Value.Set

Memory Profiling During Benchmarks

Analyze memory allocations:

# Generate memory profile
go test -bench=BenchmarkMemory -memprofile=mem.prof -benchmem

# Analyze allocations
go tool pprof mem.prof

# In pprof:
# top10 -cum    - Top 10 by cumulative allocations
# alloc_objects - Number of allocations
# alloc_space   - Total bytes allocated
# inuse_objects - Currently allocated objects
# inuse_space   - Currently allocated bytes

Example:

func BenchmarkStringBuilding(b *testing.B) {
    b.ReportAllocs()

    for i := 0; i < b.N; i++ {
        var sb strings.Builder
        for j := 0; j < 100; j++ {
            sb.WriteString(fmt.Sprintf("Item %d\n", j))
        }
        _ = sb.String()
    }
}

// go test -bench=BenchmarkStringBuilding -memprofile=mem.prof -benchmem
// go tool pprof mem.prof
//
// (pprof) alloc_space
// Showing nodes accounting for 5120MB, 100% of 5120MB total
// flat  flat%   sum%        cum   cum%
// 2560MB 50% 50%       2560MB 50%  fmt.Sprintf
// 1280MB 25% 75%       1280MB 25%  strings.(*Builder).WriteString

benchstat: Statistical Comparison Workflow

Statistical testing with benchstat:

# Run baseline
go test -bench=. -benchmem > old.txt

# Modify code

# Run new version
go test -bench=. -benchmem > new.txt

# Compare with statistical rigor
benchstat old.txt new.txt

Install benchstat:

go install golang.org/x/perf/cmd/benchstat@latest

Example output:

name                       old time/op    new time/op    delta
BenchmarkStringConcat-8    2.45 µs ± 5%   1.20 µs ± 3%  -50.20%  (p=0.000 n=10)

name                       old alloc/op   new alloc/op   delta
BenchmarkStringConcat-8    16.0 B ± 0%    0.0 B ± 0%   -100.00%  (p=0.000 n=0)

Understanding benchstat Output

time/op: Time per operation, with coefficient of variation (±5%)
delta: Percentage change (negative = improvement)
p-value: Statistical significance (p=0.000 = highly significant)
n=10: Number of measurements

p-values interpretation:

p < 0.05: Statistically significant difference
p > 0.05: No statistically significant difference (noise)

Proper Statistical Workflow

# Run minimum 10 times for statistical significance
go test -bench=. -benchmem -count=10 > baseline.txt
# Modify code
go test -bench=. -benchmem -count=10 > optimized.txt

benchstat -delta-test=welch baseline.txt optimized.txt

Options:

-delta-test=welch: Welch's t-test (default, good for unequal variance)
-delta-test=ttest: Standard t-test
-delta-test=utest: Mann-Whitney U test (nonparametric)
-delta-test=none: Just show percentages (quick exploration)

Fuzz Testing for Performance: Finding Worst-Case Inputs

Use fuzzing to find inputs that trigger slow paths:

func FuzzJsonParse(f *testing.F) {
    f.Add([]byte(`{"name":"John"}`))
    f.Add([]byte(`[]`))
    f.Add([]byte(`null`))

    f.Fuzz(func(t *testing.T, input []byte) {
        var result interface{}
        json.Unmarshal(input, &result)
    })
}

// Run with fuzzing to find worst-case inputs:
// go test -fuzz=FuzzJsonParse -fuzztime=30s
//
// Might discover that deeply nested JSON is slow:
// {"a":{"b":{"c":{"d":{"e":{"f":{"g":...}}}}}}}

Then benchmark with discovered worst-case:

func BenchmarkJsonWorstCase(b *testing.B) {
    // Deeply nested JSON (slow path)
    jsonData := generateDeepNesting(100)

    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        var result interface{}
        json.Unmarshal(jsonData, &result)
    }
}

Continuous Benchmarking in CI

Using gobenchdata

Track performance over time:

go install github.com/bobheadxi/gobenchdata@latest

# Run benchmark, store results
go test -bench=. -benchmem | gobenchdata run --json=bench.json

# On commit, check against baseline
gobenchdata compare --json=bench.json --threshold=5%

Using benchdiff

Quick CI comparison:

go install github.com/benchdiff/benchdiff@latest

# Store baseline
go test -bench=. > baseline.txt

# After code change
go test -bench=. > current.txt

# Compare
benchdiff baseline.txt current.txt

Real-World Example: Profiling and Optimizing a JSON Parser

Step-by-step performance investigation:

// Step 1: Write benchmark
func BenchmarkParseJSON(b *testing.B) {
    data := []byte(`{"users":[{"id":1,"name":"John"},{"id":2,"name":"Jane"}]}`)
    b.ResetTimer()

    for i := 0; i < b.N; i++ {
        var result interface{}
        json.Unmarshal(data, &result)
    }
}

// go test -bench=. -benchmem
// BenchmarkParseJSON-8    300000    4567 ns/op    1248 B/op    25 allocs/op

// Step 2: Profile
// go test -bench=. -cpuprofile=cpu.prof -memprofile=mem.prof
// go tool pprof cpu.prof
//
// (pprof) top
// 1500ms 50% encoding/json.(*Decoder).readValue
// 1000ms 33% runtime.mallocgc
//  500ms 17% reflect.Value.Set

// Problem identified: Excessive allocations in reflect operations

// Step 3: Optimize - use custom unmarshaler
type User struct {
    ID   int
    Name string
}

type Response struct {
    Users []User
}

func BenchmarkParseJSONOptimized(b *testing.B) {
    data := []byte(`{"users":[{"id":1,"name":"John"},{"id":2,"name":"Jane"}]}`)
    b.ResetTimer()

    for i := 0; i < b.N; i++ {
        var result Response
        json.Unmarshal(data, &result)
    }
}

// go test -bench=. -benchmem
// BenchmarkParseJSONOptimized-8    600000    1890 ns/op     512 B/op     8 allocs/op
// 58% faster, 59% fewer allocations!

// Step 4: Further optimize - decode directly
func BenchmarkParseJSONDirect(b *testing.B) {
    data := []byte(`{"users":[{"id":1,"name":"John"},{"id":2,"name":"Jane"}]}`)
    b.ResetTimer()

    for i := 0; i < b.N; i++ {
        decoder := json.NewDecoder(bytes.NewReader(data))
        var result Response
        decoder.Decode(&result)
    }
}

// BenchmarkParseJSONDirect-8       600000    1950 ns/op     896 B/op    10 allocs/op
// Similar performance, but decoding is more incremental for streams

Common Benchmarking Pitfalls

Pitfall 1: Loop Too Small, N Grows Wildly

// WRONG: N will be huge (compiler optimizes away)
func BenchmarkTinyOp(b *testing.B) {
    for i := 0; i < b.N; i++ {
        _ = 1 + 1 // Compiler might remove
    }
}

// RIGHT: Each iteration takes measurable time
func BenchmarkReasonableOp(b *testing.B) {
    for i := 0; i < b.N; i++ {
        sum := 0
        for j := 0; j < 1000; j++ {
            sum += j
        }
        Sink = sum
    }
}

Pitfall 2: Measuring Setup Cost

// WRONG: Includes allocation time
func BenchmarkBadSetup(b *testing.B) {
    for i := 0; i < b.N; i++ {
        data := make([]byte, 10_000_000)
        processData(data)
    }
}

// RIGHT: Setup outside timer
func BenchmarkGoodSetup(b *testing.B) {
    data := make([]byte, 10_000_000)
    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        processData(data)
    }
}

Pitfall 3: Not Accounting for GC

// WRONG: GC pause during benchmark distorts results
func BenchmarkWithoutGCControl(b *testing.B) {
    for i := 0; i < b.N; i++ {
        createManySmallObjects()
    }
}

// RIGHT: Disable GC for deterministic results (if measuring computation only)
func BenchmarkDisableGC(b *testing.B) {
    debug.SetGCPercent(-1)
    defer debug.SetGCPercent(100)

    b.ResetTimer()

    for i := 0; i < b.N; i++ {
        createManySmallObjects()
    }
}

// Or: Run multiple times to amortize GC cost
// go test -bench=. -count=5

Pitfall 4: Benchmark Flakiness from External Factors

// WRONG: Sensitive to system load
func BenchmarkNetworkCall(b *testing.B) {
    for i := 0; i < b.N; i++ {
        resp, _ := http.Get("https://example.com")
        resp.Body.Close()
    }
}

// RIGHT: Mock network for deterministic results
func BenchmarkNetworkCallMocked(b *testing.B) {
    server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        w.WriteHeader(http.StatusOK)
        w.Write([]byte(`{"status":"ok"}`))
    }))
    defer server.Close()

    b.ResetTimer()

    for i := 0; i < b.N; i++ {
        resp, _ := http.Get(server.URL)
        resp.Body.Close()
    }
}

Advanced: Benchmarking Memory and GC Impact

func BenchmarkGCImpact(b *testing.B) {
    b.Run("NoGC", func(b *testing.B) {
        debug.SetGCPercent(-1)
        defer debug.SetGCPercent(100)

        b.ReportAllocs()
        for i := 0; i < b.N; i++ {
            createManyObjects()
        }
    })

    b.Run("WithGC", func(b *testing.B) {
        b.ReportAllocs()
        for i := 0; i < b.N; i++ {
            createManyObjects()
        }
    })

    b.Run("GCPause", func(b *testing.B) {
        b.ReportAllocs()

        for i := 0; i < b.N; i++ {
            createManyObjects()
            if i%10 == 0 {
                runtime.GC() // Force GC to measure pause time
            }
        }
    })
}

Benchmark Flags Reference

go test -bench=pattern        # Run benchmarks matching pattern
go test -benchmem             # Report allocations
go test -benchtime=10s        # Change duration (default 1s)
go test -count=5              # Run each benchmark N times
go test -cpu=1,2,4            # Test with different GOMAXPROCS
go test -cpuprofile=cpu.prof  # Capture CPU profile
go test -memprofile=mem.prof  # Capture memory profile
go test -trace=trace.out      # Capture execution trace

Key Takeaways

Always use sink variables to prevent dead code elimination
Understand timer control: ResetTimer() after setup, avoid StopTimer/StartTimer in loops
Report allocations with -benchmem or b.ReportAllocs()
Use b.Run() for sub-benchmarks to organize related tests
Run multiple times: go test -count=10 for statistical significance
Apply benchstat for proper comparison: benchstat old.txt new.txt
Profile during benchmarks: Use -cpuprofile and -memprofile to find bottlenecks
Control environment: Disable frequency scaling, set CPU affinity for reproducible results
Mock external dependencies: Don't benchmark network calls or I/O directly
Benchmark realistic workloads: Small operations are hard to measure accurately

Benchmarking is an art and science. Correct technique, careful interpretation, and statistical rigor lead to actionable performance insights.

Benchmarking Best Practices

On this page