Benchmarking Best Practices
Write precise benchmarks in Go using testing.B, avoid pitfalls, and compare results with benchstat.
Benchmarking Best Practices
Benchmarking measures code performance objectively. Go's testing package provides excellent built-in benchmarking capabilities, but writing correct benchmarks requires understanding common pitfalls, compiler behavior, dead code elimination, profiling integration, and statistical analysis.
Writing Your First Benchmark
Benchmarks follow the same pattern as unit tests but use testing.B:
package main
import (
"testing"
)
func Reverse(s string) string {
runes := []rune(s)
for i, j := 0, len(runes)-1; i < j; i, j = i+1, j-1 {
runes[i], runes[j] = runes[j], runes[i]
}
return string(runes)
}
func BenchmarkReverse(b *testing.B) {
s := "hello world"
b.ResetTimer() // Important: exclude setup time
for i := 0; i < b.N; i++ {
Reverse(s)
}
}Run benchmarks:
go test -bench=. -benchmemOutput:
BenchmarkReverse-8 500000000 2.45 ns/op 0 B/op 0 allocs/opThis means: 500 million iterations, 2.45 nanoseconds per operation, 0 bytes allocated, 0 allocations per operation.
Dead Code Elimination (DCE): The Compiler's Optimization Threat
The Go compiler aggressively optimizes away code with unused results. Benchmarks that compute values that are never used may not measure what you think:
// WRONG: Compiler optimizes away entire computation
func BenchmarkWrongSum(b *testing.B) {
for i := 0; i < b.N; i++ {
sum := 0
for j := 0; j < 1000; j++ {
sum += j
}
// sum is never used - compiler removes entire loop!
}
}
// Actual measured time: ~0.1ns/op (compiler proved sum = 0 at compile time)The Global Sink Variable Pattern
The correct pattern is to assign results to a package-level variable that the compiler cannot prove is unused:
// Global sink variable - prevents DCE
var Sink interface{}
func BenchmarkCorrectSum(b *testing.B) {
var sum int64
b.ResetTimer()
for i := 0; i < b.N; i++ {
sum = 0
for j := 0; j < 1000; j++ {
sum += int64(j)
}
}
Sink = sum // Prevents optimization - compiler can't prove sum is unused
}
// Actual measured time: ~5.2µs/op (real computation)Why does this work? The compiler sees that:
sumis assigned in the loopsumis stored to the globalSinkvariableSinkis exported (could be read from another package)- Therefore, the compiler must preserve the computation
Different sink types for different use cases:
// For different types
var SinkInt int64
var SinkString string
var SinkBytes []byte
var SinkBool bool
var SinkFloat float64
// Or generic:
var Sink interface{}
// For interface{} sinks, there's a small type assertion overhead (~1-2ns)
// For typed sinks, overhead is negligible
func BenchmarkTypedVsInterface(b *testing.B) {
b.Run("typed", func(b *testing.B) {
for i := 0; i < b.N; i++ {
SinkInt = int64(i * i)
}
})
b.Run("interface", func(b *testing.B) {
for i := 0; i < b.N; i++ {
Sink = int64(i * i)
}
})
}
// Benchmark results:
// BenchmarkTypedVsInterface/typed-8 3000000000 0.34 ns/op
// BenchmarkTypedVsInterface/interface-8 2000000000 0.65 ns/op (type assertion ~0.3ns)Preventing Function Inlining: The //go:noinline Pragma
Sometimes the compiler inlines functions, which distorts benchmarks by eliminating function call overhead:
// With inlining
func add(a, b int) int {
return a + b
}
func BenchmarkInlined(b *testing.B) {
for i := 0; i < b.N; i++ {
SinkInt = int64(add(1, 2))
}
}
// The add function call disappears, compiler optimizes to SinkInt = 3
// With noinline pragma
//go:noinline
func addNoInline(a, b int) int {
return a + b
}
func BenchmarkNoInline(b *testing.B) {
for i := 0; i < b.N; i++ {
SinkInt = int64(addNoInline(1, 2))
}
}
// Benchmark results:
// BenchmarkInlined-8 3000000000 0.34 ns/op (no call overhead)
// BenchmarkNoInline-8 2000000000 0.64 ns/op (includes call overhead ~0.3ns)runtime.KeepAlive: Preventing GC of Benchmarked Objects
When benchmarking operations that create objects that might be garbage collected:
func BenchmarkSliceAlloc(b *testing.B) {
b.ReportAllocs()
for i := 0; i < b.N; i++ {
s := make([]byte, 1000)
// Without KeepAlive, compiler might optimize away allocation
runtime.KeepAlive(s)
}
}
// KeepAlive tells the compiler: "this object is live here"
// Prevents premature GC and allocation optimizationTimer Control: ResetTimer, StopTimer, StartTimer
Proper timer management is crucial for accurate measurements:
func BenchmarkWithSetup(b *testing.B) {
// Setup happens outside the timer (not measured)
data := make([]int, 10000)
for i := range data {
data[i] = i
}
b.ResetTimer() // Start measuring from here
for i := 0; i < b.N; i++ {
processData(data)
}
}
func BenchmarkPausedWork(b *testing.B) {
data := generateData()
for i := 0; i < b.N; i++ {
b.StopTimer()
// Cleanup per iteration (not measured)
cleanupIter := expensivePerIterationSetup()
b.StartTimer()
processWithCleanup(data, cleanupIter)
}
}Warning about StopTimer/StartTimer: These call runtime.ReadMemStats() internally, which is expensive (~microseconds). Only use for per-iteration setup that truly cannot be pre-computed.
// WRONG: StopTimer/StartTimer called 10,000,000 times
func BenchmarkBadStopStart(b *testing.B) {
for i := 0; i < b.N; i++ {
b.StopTimer()
// Small work
b.StartTimer()
// Small work
}
}
// Benchmark results: Dominated by timer overhead, not actual work!
// RIGHT: Factor out the per-iteration setup
func BenchmarkGoodSetup(b *testing.B) {
setupData := generateAllSetupData(b.N)
b.ResetTimer()
for i := 0; i < b.N; i++ {
process(setupData[i])
}
}Reporting Allocations with b.ReportAllocs
func BenchmarkAllocations(b *testing.B) {
b.ReportAllocs() // Enable allocation reporting
for i := 0; i < b.N; i++ {
// Each iteration allocates
s := make([]byte, 100)
_ = s
}
}
// Run with -benchmem or use b.ReportAllocs():
// BenchmarkAllocations-8 10000000 120 ns/op 100 B/op 1 allocs/opInterpret the output:
B/op: Bytes allocated per operationallocs/op: Number of allocations per operation (calls to malloc)
Lower is better for both. Even allocations that are freed quickly increase GC pressure and affect latency.
// Example: Identifying allocation inefficiencies
var Sink []byte
func BenchmarkAllocationPatterns(b *testing.B) {
b.Run("Efficient", func(b *testing.B) {
b.ReportAllocs()
for i := 0; i < b.N; i++ {
// Pre-allocate exact capacity
buf := make([]byte, 0, 1000)
for j := 0; j < 1000; j++ {
buf = append(buf, byte(j))
}
Sink = buf
}
})
b.Run("Inefficient", func(b *testing.B) {
b.ReportAllocs()
for i := 0; i < b.N; i++ {
// No pre-allocation, grows dynamically
var buf []byte
for j := 0; j < 1000; j++ {
buf = append(buf, byte(j))
}
Sink = buf
}
})
}
// Results:
// BenchmarkAllocationPatterns/Efficient-8 10000 150000 ns/op 1024 B/op 1 allocs/op
// BenchmarkAllocationPatterns/Inefficient-8 1000 5000000 ns/op 10240 B/op 11 allocs/op
// Inefficient version:
// - 5x slower (10 reallocations as slice grows)
// - 10x more memory (larger final size from overallocation)
// - 11 allocations vs 1 (each append triggers reallocation)Sub-Benchmarks with b.Run()
Organize related benchmarks into families:
func BenchmarkStringOps(b *testing.B) {
b.Run("Concat", func(b *testing.B) {
b.ReportAllocs()
for i := 0; i < b.N; i++ {
_ = "hello" + "world"
}
})
b.Run("Builder", func(b *testing.B) {
b.ReportAllocs()
for i := 0; i < b.N; i++ {
var sb strings.Builder
sb.WriteString("hello")
sb.WriteString("world")
_ = sb.String()
}
})
b.Run("Join", func(b *testing.B) {
b.ReportAllocs()
for i := 0; i < b.N; i++ {
_ = strings.Join([]string{"hello", "world"}, "")
}
})
}
// Run specific benchmarks:
// go test -bench=StringOps/Builder
// go test -bench=StringOpsTable-Driven Benchmarks
Benchmark functions with varying inputs:
func BenchmarkParseInt(b *testing.B) {
tests := []struct {
name string
input string
}{
{"Small", "42"},
{"Medium", "123456789"},
{"Large", "9223372036854775807"},
{"Negative", "-9223372036854775808"},
}
for _, tt := range tests {
b.Run(tt.name, func(b *testing.B) {
b.ReportAllocs()
for i := 0; i < b.N; i++ {
_, _ = strconv.ParseInt(tt.input, 10, 64)
}
})
}
}
// Results show performance differences by input sizeBenchmark Stability and Gotchas
Gotcha 1: Timer Resolution
Different platforms have different timer resolutions:
// On Linux: ~1 nanosecond resolution
// On Windows/macOS: ~100 nanosecond resolution
// If your benchmark runs in < 100ns on Windows, results are unreliable
// Solution: Run more iterations
func BenchmarkTinyOp(b *testing.B) {
// This is too fast to measure reliably
for i := 0; i < b.N; i++ {
_ = 1 + 1
}
}
// Better: Batch operations
func BenchmarkTinyOpBatched(b *testing.B) {
for i := 0; i < b.N; i++ {
sum := 0
for j := 0; j < 1000; j++ {
sum += 1
}
Sink = sum
}
}Gotcha 2: Compiler Optimizations Between Iterations
The compiler may optimize differently based on runtime profiles:
// Go's compiler performs profile-guided optimizations
// Multiple runs can have slightly different performance due to:
// - Branch prediction behavior
// - CPU cache state
// - Power management transitions
// Solution: Use -count flag to run multiple times
// go test -bench=. -count=5 # Run each benchmark 5 timesGotcha 3: CPU Frequency Scaling
Laptop/server CPU frequency scaling affects results:
# On Linux, check current CPU frequency
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_driver
cat /proc/cpuinfo | grep MHz
# Disable frequency scaling for consistent results
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
# Run benchmarks, then restore
echo powersave | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governorGotcha 4: Cache Warming - First Iterations vs Steady State
The first few iterations might be slower due to cold caches:
func BenchmarkWithCacheWarming(b *testing.B) {
// Array larger than L3 cache (~20MB on modern CPUs)
data := make([]int, 10_000_000)
for i := range data {
data[i] = i
}
b.ResetTimer()
for i := 0; i < b.N; i++ {
// First iteration: cache miss penalty
// Subsequent iterations: potentially faster due to cache hits
sumSequential(data)
}
}
// To measure cold cache performance:
// Disable L3 cache before benchmark, or use different cache-busting dataGotcha 5: Power Management
Server CPUs maintain performance states that affect benchmarking:
// Some CPUs have C-states that reduce frequency when idle
// Results vary based on system load and thermal state
// On Linux, check C-states:
cat /sys/devices/system/cpu/cpu0/cpuidle/state0/name
// Disable C-states for benchmarking (requires root):
echo 1 > /sys/devices/system/cpu/cpu0/cpuidle/state0/disableCustom Benchmark Metrics with b.ReportMetric
Report custom measurements beyond time and allocations:
func BenchmarkThroughput(b *testing.B) {
b.ReportAllocs()
itemsProcessed := 0
for i := 0; i < b.N; i++ {
// Process 100 items per iteration
itemsProcessed += 100
doWork()
}
b.ReportMetric(float64(itemsProcessed), "items")
b.ReportMetric(float64(itemsProcessed)/b.Elapsed().Seconds(), "items/sec")
}
// Output:
// BenchmarkThroughput-8 1000 1234567 ns/op 100 items 80.7k items/secCPU Profiling During Benchmarks
Combine benchmarks with CPU profiling:
# Generate CPU profile
go test -bench=BenchmarkExpensive -cpuprofile=cpu.prof
# Analyze with pprof
go tool pprof cpu.prof
# In pprof:
# top10 - Show top 10 functions by CPU time
# list main.Expensive - Show annotated source
# graph - Generate call graphExample analysis:
func BenchmarkJsonParse(b *testing.B) {
jsonData := []byte(`{"name":"John","age":30,"city":"New York"}`)
b.ResetTimer()
for i := 0; i < b.N; i++ {
var result map[string]interface{}
json.Unmarshal(jsonData, &result)
}
}
// go test -bench=BenchmarkJsonParse -cpuprofile=cpu.prof
// go tool pprof cpu.prof
//
// (pprof) top10
// Showing nodes accounting for 1500ms, 75% of 2000ms total
// flat flat% sum% cum cum%
// 800ms 40.0% 40.0% 800ms 40.0% encoding/json.(*Decoder).readValue
// 400ms 20.0% 60.0% 400ms 20.0% runtime.mallocgc
// 300ms 15.0% 75.0% 700ms 35.0% reflect.Value.SetMemory Profiling During Benchmarks
Analyze memory allocations:
# Generate memory profile
go test -bench=BenchmarkMemory -memprofile=mem.prof -benchmem
# Analyze allocations
go tool pprof mem.prof
# In pprof:
# top10 -cum - Top 10 by cumulative allocations
# alloc_objects - Number of allocations
# alloc_space - Total bytes allocated
# inuse_objects - Currently allocated objects
# inuse_space - Currently allocated bytesExample:
func BenchmarkStringBuilding(b *testing.B) {
b.ReportAllocs()
for i := 0; i < b.N; i++ {
var sb strings.Builder
for j := 0; j < 100; j++ {
sb.WriteString(fmt.Sprintf("Item %d\n", j))
}
_ = sb.String()
}
}
// go test -bench=BenchmarkStringBuilding -memprofile=mem.prof -benchmem
// go tool pprof mem.prof
//
// (pprof) alloc_space
// Showing nodes accounting for 5120MB, 100% of 5120MB total
// flat flat% sum% cum cum%
// 2560MB 50% 50% 2560MB 50% fmt.Sprintf
// 1280MB 25% 75% 1280MB 25% strings.(*Builder).WriteStringbenchstat: Statistical Comparison Workflow
Statistical testing with benchstat:
# Run baseline
go test -bench=. -benchmem > old.txt
# Modify code
# Run new version
go test -bench=. -benchmem > new.txt
# Compare with statistical rigor
benchstat old.txt new.txtInstall benchstat:
go install golang.org/x/perf/cmd/benchstat@latestExample output:
name old time/op new time/op delta
BenchmarkStringConcat-8 2.45 µs ± 5% 1.20 µs ± 3% -50.20% (p=0.000 n=10)
name old alloc/op new alloc/op delta
BenchmarkStringConcat-8 16.0 B ± 0% 0.0 B ± 0% -100.00% (p=0.000 n=0)Understanding benchstat Output
- time/op: Time per operation, with coefficient of variation (±5%)
- delta: Percentage change (negative = improvement)
- p-value: Statistical significance (p=0.000 = highly significant)
- n=10: Number of measurements
p-values interpretation:
p < 0.05: Statistically significant differencep > 0.05: No statistically significant difference (noise)
Proper Statistical Workflow
# Run minimum 10 times for statistical significance
go test -bench=. -benchmem -count=10 > baseline.txt
# Modify code
go test -bench=. -benchmem -count=10 > optimized.txt
benchstat -delta-test=welch baseline.txt optimized.txtOptions:
-delta-test=welch: Welch's t-test (default, good for unequal variance)-delta-test=ttest: Standard t-test-delta-test=utest: Mann-Whitney U test (nonparametric)-delta-test=none: Just show percentages (quick exploration)
Fuzz Testing for Performance: Finding Worst-Case Inputs
Use fuzzing to find inputs that trigger slow paths:
func FuzzJsonParse(f *testing.F) {
f.Add([]byte(`{"name":"John"}`))
f.Add([]byte(`[]`))
f.Add([]byte(`null`))
f.Fuzz(func(t *testing.T, input []byte) {
var result interface{}
json.Unmarshal(input, &result)
})
}
// Run with fuzzing to find worst-case inputs:
// go test -fuzz=FuzzJsonParse -fuzztime=30s
//
// Might discover that deeply nested JSON is slow:
// {"a":{"b":{"c":{"d":{"e":{"f":{"g":...}}}}}}}Then benchmark with discovered worst-case:
func BenchmarkJsonWorstCase(b *testing.B) {
// Deeply nested JSON (slow path)
jsonData := generateDeepNesting(100)
b.ResetTimer()
for i := 0; i < b.N; i++ {
var result interface{}
json.Unmarshal(jsonData, &result)
}
}Continuous Benchmarking in CI
Using gobenchdata
Track performance over time:
go install github.com/bobheadxi/gobenchdata@latest
# Run benchmark, store results
go test -bench=. -benchmem | gobenchdata run --json=bench.json
# On commit, check against baseline
gobenchdata compare --json=bench.json --threshold=5%Using benchdiff
Quick CI comparison:
go install github.com/benchdiff/benchdiff@latest
# Store baseline
go test -bench=. > baseline.txt
# After code change
go test -bench=. > current.txt
# Compare
benchdiff baseline.txt current.txtReal-World Example: Profiling and Optimizing a JSON Parser
Step-by-step performance investigation:
// Step 1: Write benchmark
func BenchmarkParseJSON(b *testing.B) {
data := []byte(`{"users":[{"id":1,"name":"John"},{"id":2,"name":"Jane"}]}`)
b.ResetTimer()
for i := 0; i < b.N; i++ {
var result interface{}
json.Unmarshal(data, &result)
}
}
// go test -bench=. -benchmem
// BenchmarkParseJSON-8 300000 4567 ns/op 1248 B/op 25 allocs/op
// Step 2: Profile
// go test -bench=. -cpuprofile=cpu.prof -memprofile=mem.prof
// go tool pprof cpu.prof
//
// (pprof) top
// 1500ms 50% encoding/json.(*Decoder).readValue
// 1000ms 33% runtime.mallocgc
// 500ms 17% reflect.Value.Set
// Problem identified: Excessive allocations in reflect operations
// Step 3: Optimize - use custom unmarshaler
type User struct {
ID int
Name string
}
type Response struct {
Users []User
}
func BenchmarkParseJSONOptimized(b *testing.B) {
data := []byte(`{"users":[{"id":1,"name":"John"},{"id":2,"name":"Jane"}]}`)
b.ResetTimer()
for i := 0; i < b.N; i++ {
var result Response
json.Unmarshal(data, &result)
}
}
// go test -bench=. -benchmem
// BenchmarkParseJSONOptimized-8 600000 1890 ns/op 512 B/op 8 allocs/op
// 58% faster, 59% fewer allocations!
// Step 4: Further optimize - decode directly
func BenchmarkParseJSONDirect(b *testing.B) {
data := []byte(`{"users":[{"id":1,"name":"John"},{"id":2,"name":"Jane"}]}`)
b.ResetTimer()
for i := 0; i < b.N; i++ {
decoder := json.NewDecoder(bytes.NewReader(data))
var result Response
decoder.Decode(&result)
}
}
// BenchmarkParseJSONDirect-8 600000 1950 ns/op 896 B/op 10 allocs/op
// Similar performance, but decoding is more incremental for streamsCommon Benchmarking Pitfalls
Pitfall 1: Loop Too Small, N Grows Wildly
// WRONG: N will be huge (compiler optimizes away)
func BenchmarkTinyOp(b *testing.B) {
for i := 0; i < b.N; i++ {
_ = 1 + 1 // Compiler might remove
}
}
// RIGHT: Each iteration takes measurable time
func BenchmarkReasonableOp(b *testing.B) {
for i := 0; i < b.N; i++ {
sum := 0
for j := 0; j < 1000; j++ {
sum += j
}
Sink = sum
}
}Pitfall 2: Measuring Setup Cost
// WRONG: Includes allocation time
func BenchmarkBadSetup(b *testing.B) {
for i := 0; i < b.N; i++ {
data := make([]byte, 10_000_000)
processData(data)
}
}
// RIGHT: Setup outside timer
func BenchmarkGoodSetup(b *testing.B) {
data := make([]byte, 10_000_000)
b.ResetTimer()
for i := 0; i < b.N; i++ {
processData(data)
}
}Pitfall 3: Not Accounting for GC
// WRONG: GC pause during benchmark distorts results
func BenchmarkWithoutGCControl(b *testing.B) {
for i := 0; i < b.N; i++ {
createManySmallObjects()
}
}
// RIGHT: Disable GC for deterministic results (if measuring computation only)
func BenchmarkDisableGC(b *testing.B) {
debug.SetGCPercent(-1)
defer debug.SetGCPercent(100)
b.ResetTimer()
for i := 0; i < b.N; i++ {
createManySmallObjects()
}
}
// Or: Run multiple times to amortize GC cost
// go test -bench=. -count=5Pitfall 4: Benchmark Flakiness from External Factors
// WRONG: Sensitive to system load
func BenchmarkNetworkCall(b *testing.B) {
for i := 0; i < b.N; i++ {
resp, _ := http.Get("https://example.com")
resp.Body.Close()
}
}
// RIGHT: Mock network for deterministic results
func BenchmarkNetworkCallMocked(b *testing.B) {
server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
w.WriteHeader(http.StatusOK)
w.Write([]byte(`{"status":"ok"}`))
}))
defer server.Close()
b.ResetTimer()
for i := 0; i < b.N; i++ {
resp, _ := http.Get(server.URL)
resp.Body.Close()
}
}Advanced: Benchmarking Memory and GC Impact
func BenchmarkGCImpact(b *testing.B) {
b.Run("NoGC", func(b *testing.B) {
debug.SetGCPercent(-1)
defer debug.SetGCPercent(100)
b.ReportAllocs()
for i := 0; i < b.N; i++ {
createManyObjects()
}
})
b.Run("WithGC", func(b *testing.B) {
b.ReportAllocs()
for i := 0; i < b.N; i++ {
createManyObjects()
}
})
b.Run("GCPause", func(b *testing.B) {
b.ReportAllocs()
for i := 0; i < b.N; i++ {
createManyObjects()
if i%10 == 0 {
runtime.GC() // Force GC to measure pause time
}
}
})
}Benchmark Flags Reference
go test -bench=pattern # Run benchmarks matching pattern
go test -benchmem # Report allocations
go test -benchtime=10s # Change duration (default 1s)
go test -count=5 # Run each benchmark N times
go test -cpu=1,2,4 # Test with different GOMAXPROCS
go test -cpuprofile=cpu.prof # Capture CPU profile
go test -memprofile=mem.prof # Capture memory profile
go test -trace=trace.out # Capture execution traceKey Takeaways
- Always use sink variables to prevent dead code elimination
- Understand timer control: ResetTimer() after setup, avoid StopTimer/StartTimer in loops
- Report allocations with
-benchmemorb.ReportAllocs() - Use b.Run() for sub-benchmarks to organize related tests
- Run multiple times:
go test -count=10for statistical significance - Apply benchstat for proper comparison:
benchstat old.txt new.txt - Profile during benchmarks: Use
-cpuprofileand-memprofileto find bottlenecks - Control environment: Disable frequency scaling, set CPU affinity for reproducible results
- Mock external dependencies: Don't benchmark network calls or I/O directly
- Benchmark realistic workloads: Small operations are hard to measure accurately
Benchmarking is an art and science. Correct technique, careful interpretation, and statistical rigor lead to actionable performance insights.