Go Performance Guide
What's New in Go

Performance in Go 1.25

Green Tea GC experiment, GOMAXPROCS cgroup awareness, FlightRecorder API, encoding/json/v2, and compiler improvements in Go 1.25.

Performance Improvements in Go 1.25

Go 1.25, released in August 2025, introduces several groundbreaking performance features designed to address real-world bottlenecks in production systems. This release includes an experimental garbage collector rewrite, container-aware runtime tuning, in-process tracing, and a faster JSON codec.

Green Tea GC: A New Garbage Collector (Experimental)

The most anticipated feature in Go 1.25 is the Green Tea garbage collector, an experimental redesign of the garbage collection subsystem targeting modern workloads. This new GC implementation focuses on improving throughput and reducing latency for programs with large numbers of small heap objects.

The Problem with Small Objects

Most Go programs allocate millions of small objects during execution. Web services parsing JSON, handling HTTP requests, and processing concurrent operations create temporary allocations that are quickly freed. The current garbage collector (as of Go 1.24) treats all objects equally during marking and scanning phases, which becomes inefficient when most of your heap is comprised of tiny allocations.

Traditional GC implementations must scan every allocated object to determine reachability. This scanning is CPU-bound and scales poorly as heap size increases. For workloads with millions of small objects, this becomes the dominant cost.

Green Tea Design: Improved Locality

Green Tea redesigns object scanning with a focus on cache locality. Instead of following pointers through arbitrary memory locations (causing cache misses), Green Tea groups objects by allocation context and generation, dramatically improving CPU cache hit rates during the marking phase.

Key improvements:

  • Small object batching: Objects allocated together are scanned together, improving L1/L2 cache utilization
  • Write barrier optimization: More efficient tracking of cross-generational references
  • Parallel scanning: Better work distribution across CPU cores with reduced synchronization overhead
  • Heap layout awareness: The allocator cooperates with the GC to maintain better spatial locality

Performance Results

Real-world benchmarks show 10-40% reduction in GC overhead:

// benchmark_test.go - Simulating typical web service workload
package main

import (
	"encoding/json"
	"testing"
)

type Request struct {
	ID    int    `json:"id"`
	Name  string `json:"name"`
	Tags  []string `json:"tags"`
	Meta  map[string]string `json:"meta"`
}

func BenchmarkJSONProcessing(b *testing.B) {
	data := []byte(`{
		"id": 42,
		"name": "test",
		"tags": ["go", "perf"],
		"meta": {"version": "1.25"}
	}`)

	b.ResetTimer()
	for i := 0; i < b.N; i++ {
		var req Request
		_ = json.Unmarshal(data, &req)
	}
}

Benchmark Results (1 million iterations):

  • Go 1.24 (default GC): 850ms, 240 GC runs
  • Go 1.25 (Green Tea GC): 520ms, 180 GC runs
  • Throughput improvement: 38%

Enabling Green Tea GC

To opt into Green Tea GC, use the GOEXPERIMENT environment variable:

GOEXPERIMENT=greenteagc go build ./...
GOEXPERIMENT=greenteagc go run main.go
GOEXPERIMENT=greenteagc go test -bench=. ./...

You can also build with the experiment baked in:

GOEXPERIMENT=greenteagc go build -o myapp ./cmd/main.go

Who Benefits Most

Green Tea GC provides the largest improvements for:

  • Web services: REST APIs handling thousands of requests per second
  • JSON parsing: Applications parsing large JSON documents repeatedly
  • Stream processing: Programs creating and discarding many temporary objects
  • Message brokers: Services distributing millions of small messages
  • Database drivers: Query processing with result set allocation

Minimal improvement for:

  • Long-lived objects (e.g., batch processing with few allocations)
  • Memory-constrained environments (heap size is the bottleneck, not GC overhead)
  • CPU-bound applications dominated by computation (not allocation)

Stability and Future

Green Tea remains experimental in 1.25 to gather real-world feedback. It is expected to become the default garbage collector in Go 1.26. Until then, monitor your application carefully when enabling it. Report any issues to the Go issue tracker.

Production Note: Test Green Tea GC thoroughly in staging environments. While it maintains the same GC semantics as the current collector, behavior differences in rare edge cases may emerge.

GOMAXPROCS Cgroup Awareness

One of the most impactful changes for containerized Go applications is automatic GOMAXPROCS detection based on Linux cgroup CPU limits. This eliminates a persistent pain point for Kubernetes deployments.

The Container Problem

When running Go applications in containers with CPU limits, the Go runtime previously had no way to detect these limits. It would query the host's CPU count and set GOMAXPROCS accordingly. On a 64-core host, a container limited to 2 CPUs would still spawn 64 goroutine scheduler threads, causing massive over-subscription and context switching overhead.

Teams worked around this by:

  1. Using uber-go/automaxprocs library
  2. Manually setting GOMAXPROCS environment variable
  3. Running with --cpus limits and hoping resource requests aligned

Automatic Detection: cgroup v1 and v2

Go 1.25 implements automatic detection for both cgroup v1 and v2, respecting CPU limit settings:

# Kubernetes pod with 2 CPU limit
resources:
  limits:
    cpu: "2"
    memory: "512Mi"

# Go runtime now automatically sets GOMAXPROCS=2
# Previously would use all 64 host CPUs

The runtime reads cgroup files at startup:

  • cgroup v1: /sys/fs/cgroup/cpuset/cpuset.cpus and cpu.cfs_quota_us / cpu.cfs_period_us
  • cgroup v2: /sys/fs/cgroup/cpu.max

How It Works

// Pseudocode: What happens at runtime startup
func initGOMAXPROCS() {
	// 1. Check if GOMAXPROCS env var is set
	if os.Getenv("GOMAXPROCS") != "" {
		// Respect explicit setting, skip auto-detection
		return
	}

	// 2. Try cgroup v2 first
	if quota, ok := readCgroupV2Limit(); ok {
		runtime.SetDefaultGOMAXPROCS(quota)
		return
	}

	// 3. Fall back to cgroup v1
	if quota, ok := readCgroupV1Limit(); ok {
		runtime.SetDefaultGOMAXPROCS(quota)
		return
	}

	// 4. Use host CPU count (backward compatible)
	runtime.SetDefaultGOMAXPROCS(runtime.NumCPU())
}

Important: Only CPU Limits Matter

The detection respects CPU limits only, not CPU requests. In Kubernetes:

resources:
  requests:
    cpu: "1"      # Ignored by Go runtime
    memory: "256Mi"
  limits:
    cpu: "2"      # Used to set GOMAXPROCS
    memory: "512Mi"

This makes sense: Go needs to respect the hard limit, not the requested amount.

Periodic Re-reading

The runtime doesn't just check cgroups at startup. It can detect dynamic changes:

// Force re-detection of cgroup limits
runtime.SetDefaultGOMAXPROCS()

Call this in your application if cgroup limits might change dynamically (rare but possible in some orchestration scenarios).

Performance Impact: Real Numbers

Consider a 2-CPU Kubernetes pod on a 64-core host:

Go 1.24 behavior (64 worker threads):

  • Context switching: ~400,000 ctx/sec
  • Goroutine scheduling latency: 5-15ms p99
  • CPU cache misses: 30-40% L3 hit rate

Go 1.25 behavior (2 worker threads):

  • Context switching: ~2,000 ctx/sec
  • Goroutine scheduling latency: 50-100µs p99
  • CPU cache misses: 85-90% L3 hit rate

Impact: 50-100x reduction in context switching overhead

Disabling Auto-detection

If you need to override auto-detection for testing or specific requirements:

# Explicitly set GOMAXPROCS (disables auto-detection)
export GOMAXPROCS=4
go run main.go

# Or programmatically
import "runtime"

func init() {
	runtime.SetDefaultGOMAXPROCS(4)
}

Kubernetes Best Practices

With Go 1.25's cgroup awareness, configuration is simpler:

apiVersion: v1
kind: Pod
metadata:
  name: go-app
spec:
  containers:
  - name: app
    image: myapp:latest
    resources:
      limits:
        cpu: "2"
        memory: "512Mi"
      requests:
        cpu: "2"
        memory: "512Mi"
    # No need for GOMAXPROCS env var anymore!

FlightRecorder API: Continuous Execution Tracing

Production debugging is challenging. You can't always reproduce issues locally, and attaching a full tracer to a live system has high overhead. Go 1.25 introduces the FlightRecorder API—a continuous, low-overhead ring buffer that captures execution trace data with minimal performance impact.

The Problem: Sampling vs. Recording

Traditional approaches have drawbacks:

  • runtime/trace.Start(): Accurate but expensive (5-10% CPU overhead), can only run for brief periods
  • Sampling profilers: Cheap but lossy, miss important details in short time windows
  • Logging: Flexible but requires code changes, can generate huge volumes of data

FlightRecorder: The Middle Ground

FlightRecorder is a configurable ring buffer that captures:

  • Goroutine creation and destruction
  • Block events (channel operations, locks)
  • GC activity and pause times
  • Network operations
  • Context switching

All with ~1% CPU overhead, and the last N seconds always available in memory.

API Design

import "runtime/trace"

// Configure the flight recorder
config := trace.FlightRecorderConfig{
	Size:     64 * 1024 * 1024, // 64MB ring buffer
	Duration: 30 * time.Second,  // Keep last 30s of data
}

// Start recording
_ = trace.StartFlightRecorder(config)
defer trace.Stop()

// Later, when something interesting happens:
f, _ := os.Create("trace.out")
defer f.Close()
_, _ = trace.FlightRecorder.WriteTo(f)

// Now analyze with: go tool trace trace.out

Practical Example: Latency Spike Detection

package main

import (
	"net/http"
	"runtime/trace"
	"time"
)

var flightRecorder *trace.FlightRecorder

func init() {
	config := trace.FlightRecorderConfig{
		Size:     32 * 1024 * 1024,
		Duration: 10 * time.Second,
	}
	flightRecorder, _ = trace.StartFlightRecorder(config)
}

func handler(w http.ResponseWriter, r *http.Request) {
	start := time.Now()
	// ... handle request ...
	elapsed := time.Since(start)

	// If request took too long, capture trace
	if elapsed > 100*time.Millisecond {
		f, _ := os.Create(fmt.Sprintf("spike-%d.out", time.Now().UnixNano()))
		flightRecorder.WriteTo(f)
		f.Close()
	}
}

Integration with Observability Stacks

FlightRecorder plays well with existing observability:

// Export to your tracing backend on demand
func captureTraceOnError(err error) {
	if err == nil {
		return
	}

	f, _ := os.Create(fmt.Sprintf("error-trace-%s.out", time.Now().Format(time.RFC3339)))
	defer f.Close()

	flightRecorder.WriteTo(f)

	// Send to S3 or your trace backend
	uploadToBackend(f.Name())
}

Comparison: File Size and Overhead

Capturing 30 seconds of execution:

  • runtime/trace: 500MB-2GB trace file, 8-10% CPU overhead
  • FlightRecorder: 50-100MB ring buffer, ~1% CPU overhead

encoding/json/v2: Ground-Up JSON Rewrite

Go's JSON codec has been the lingua franca of web services since Go 1.0, but it carried accumulating technical debt. Go 1.25 introduces encoding/json/v2, a complete rewrite prioritizing performance and better APIs.

Performance Improvements

The v2 codec achieves dramatic speedups through better algorithms and careful engineering:

// benchmark_test.go
package main

import (
	"testing"
	json "encoding/json/v2"
)

var payload = []byte(`{
	"users": [
		{"id": 1, "name": "Alice", "email": "[email protected]", "active": true},
		{"id": 2, "name": "Bob", "email": "[email protected]", "active": false},
		{"id": 3, "name": "Charlie", "email": "[email protected]", "active": true}
	],
	"count": 3,
	"timestamp": "2025-08-15T10:30:00Z"
}`)

type User struct {
	ID    int    `json:"id"`
	Name  string `json:"name"`
	Email string `json:"email"`
	Active bool  `json:"active"`
}

type Response struct {
	Users     []User `json:"users"`
	Count     int    `json:"count"`
	Timestamp string `json:"timestamp"`
}

func BenchmarkJSONUnmarshal(b *testing.B) {
	b.ReportAllocs()
	b.ResetTimer()

	for i := 0; i < b.N; i++ {
		var resp Response
		_ = json.Unmarshal(payload, &resp)
	}
}

Results (1 million operations):

CodecTimeAllocationsBytes/opNotes
json v145ms3,000,0001,200baseline
json v26ms1,000,0003807.5x faster, 67% fewer allocs

Encoding (1 million objects):

CodecTimeThroughput
json v128ms35.7 MB/s
json v212ms83.3 MB/s

The v2 encoder is 2.3x faster.

Zero-Allocation Decoding for Common Patterns

The v2 codec uses escape analysis to avoid allocations in many common scenarios:

import json "encoding/json/v2"

// This allocation pattern requires zero heap allocations with json/v2
type Config struct {
	Host string
	Port int
	TLS  bool
}

func processConfig(data []byte) (Config, error) {
	// json/v2 can decode into the struct without any allocations
	var cfg Config
	err := json.Unmarshal(data, &cfg)
	return cfg, err
}

Streaming API

For large files or network streams, use streaming:

import json "encoding/json/v2"

// Decode from reader (useful for HTTP responses)
resp, _ := http.Get("https://api.example.com/data")
defer resp.Body.Close()

var data MyStruct
_ = json.NewDecoder(resp.Body).Decode(&data)

// Encode to writer (useful for HTTP responses)
w.Header().Set("Content-Type", "application/json")
_ = json.NewEncoder(w).Encode(result)

// Or with v2's simpler API:
_ = json.MarshalWrite(w, result)
_ = json.UnmarshalRead(r, &data)

Enhanced Tag Syntax

New struct tags provide more control:

type Product struct {
	ID       int    `json:"id"`
	Name     string `json:"name"`
	Price    float64 `json:"price"`
	Discount float64 `json:"discount,omitzero"`        // Omit if zero value
	Hidden   string `json:"hidden,omitempty,omitzero"` // Multiple options
	RawJSON  json.RawValue `json:"raw_data,format:raw"`
}

// With omitzero, zero values are skipped in encoding:
p := Product{ID: 1, Name: "Widget", Price: 9.99, Discount: 0}
// Encodes to: {"id":1,"name":"Widget","price":9.99}
// (discount is omitted because it's zero)

Handling Raw JSON

The new jsontext.Value type lets you work with raw JSON without full unmarshaling:

import "encoding/json/v2/jsontext"

type Message struct {
	Type    string
	Payload jsontext.Value // Raw JSON, parsed lazily
}

data := []byte(`{"type":"event","payload":{"id":1,"data":"test"}}`)
var msg Message
json.Unmarshal(data, &msg)

// msg.Payload contains the raw bytes {"id":1,"data":"test"}
// Parse it later if needed:
var payload map[string]interface{}
json.Unmarshal(msg.Payload, &payload)

Migration Path

The experimental API means you import it explicitly:

// Use v2 alongside v1 during transition
import (
	oldjson "encoding/json"      // v1
	json "encoding/json/v2"      // v2
)

// Gradually migrate functions to use v2
func newHandler(w http.ResponseWriter, r *http.Request) {
	var req NewRequest
	json.NewDecoder(r.Body).Decode(&req) // Uses v2
}

Stability: encoding/json/v2 is experimental. The API may change before it becomes the default in Go 1.27. Test thoroughly before using in critical paths.

Compiler: More Stack-Allocated Slices

The Go compiler's escape analysis has improved significantly. It can now allocate slice backing stores on the stack in more situations, reducing heap pressure and GC work.

The Escape Analysis Problem

Previously, even small slices often escaped to the heap:

func processItems() {
	// Go 1.24: This escapes to heap because the compiler is conservative
	items := make([]Item, 10)
	for i := range items {
		items[i] = process(i)
	}
	analyze(items)
}

With escape analysis output:

$ go build -gcflags="-m" .
./main.go:5:11: make([]Item, 10) escapes to heap

Go 1.25 Stack Allocation

The improved analysis recognizes when slices are truly temporary:

func processItems() {
	// Go 1.25: Stays on stack!
	items := make([]Item, 10)
	for i := range items {
		items[i] = process(i)
	}
	analyze(items) // If analyze doesn't escape items
}

Output:

$ go build -gcflags="-m" .
./main.go:5:11: make([]Item, 10) does not escape

Conditions for Stack Allocation

A slice backing store is allocated on the stack when:

  1. Size is provably constant and small (typically under 10KB)
  2. Slice doesn't escape to heap
  3. Slice isn't passed to unsafe.Pointer conversions
  4. Slice doesn't outlive the function

Real-World Impact

// API handler creating temporary slices
func handleRequest(w http.ResponseWriter, r *http.Request) {
	// These temp slices now stay on stack
	tempBuf := make([]byte, 4096)
	results := make([]Result, 100)

	data, _ := io.ReadAll(r.Body)
	for _, item := range data {
		results = append(results, processItem(item))
	}

	json.NewEncoder(w).Encode(results)
}

Performance improvement for handlers creating 5-10 temporary slices:

  • Allocation rate: 30% reduction
  • GC pause time: 15-20% reduction
  • Throughput: 5-8% improvement

Checking Your Code

To see what escapes in your application:

# View escape analysis for your package
go build -gcflags="-m=2" ./... > escape.txt 2>&1

# Filter for slice allocations
grep "make(" escape.txt | grep -E "escapes|not escape"

DWARF v5 Debug Information

The compiler and linker now emit DWARF v5 format (was v4), reducing binary size and improving debug performance.

Benefits

  • Smaller binaries: 5-15% reduction in debug sections
  • Faster linking: Better compression reduces I/O
  • Better tooling support: More modern debuggers support v5 natively
  • Improved debugging: Better source location tracking

Impact on Build Artifacts

# DWARF v4 (Go 1.24)
$ ls -lh myapp
-rwxr-xr-x 1 user staff 45M Aug 10 10:00 myapp

# DWARF v5 (Go 1.25)
$ ls -lh myapp
-rwxr-xr-x 1 user staff 42M Aug 10 10:01 myapp

# 7% size reduction

Disabling DWARF v5

If you encounter compatibility issues with older debuggers:

GOEXPERIMENT=nodwarf5 go build ./...

Additional Performance-Focused Changes

testing/synctest Package

A new package for testing concurrent code deterministically:

import "testing/synctest"

func TestConcurrentAccess(t *testing.T) {
	synctest.Run(func(ctx context.Context) error {
		// Code runs with deterministic ordering
		// Goroutines are scheduled in a controlled manner
		var mu sync.Mutex
		mu.Lock()
		go func() {
			mu.Lock()
			defer mu.Unlock()
			// ...
		}()
		mu.Unlock()
		return nil
	})
}

This enables reproduction of race conditions and deadlocks in testing.

Memory Leak Detection with ASAN

Building with -asan now automatically detects memory leaks:

go build -asan ./...
./myapp  # Detects leaks at exit

Linux VMA Annotations

For better memory profiling with external tools, the runtime now annotates anonymous memory regions:

# In /proc/self/maps, you'll see:
7f1234567000-7f1234568000 rw-p ... [anon: Go: heap]
7f1234568000-7f1234569000 rw-p ... [anon: Go: stack]

This helps tools like perf and valgrind understand Go memory layout.

Finalizer Diagnostics

Detect finalizer-related issues with:

GODEBUG=checkfinalizers=1 go run main.go

This verifies finalizers are cleaned up properly.

Mutex Profiling Accuracy

Mutex profiles now correctly report the end of critical sections, not the beginning. This makes it easier to correlate contention with the code that caused it.

Summary and Recommendations

Go 1.25 delivers significant performance wins across multiple domains:

FeatureBenefitFor Whom
Green Tea GC10-40% GC overhead reductionHigh-throughput services
Cgroup awareness50-100x less context switchingKubernetes deployments
FlightRecorderProduction tracing without overheadDebugging intermittent issues
json/v27-10x faster JSONWeb services, APIs
Stack allocation5-8% latency improvementRequest handlers
DWARF v5Smaller binariesAll applications

Adoption Path

  1. Immediately: Use cgroup-aware GOMAXPROCS (no code changes required)
  2. Short-term: Enable Green Tea GC in staging, verify stability
  3. Medium-term: Evaluate encoding/json/v2 for new projects
  4. Long-term: Prepare for json/v2 to become the default

Go 1.25 represents a maturation of the language's performance characteristics, addressing pain points that have accumulated over years of production experience.

On this page