Swiss Tables map implementation, sync.Map rewrite, runtime.AddCleanup, testing.B.Loop, and CGO annotations in Go 1.24.

Performance in Go 1.24

Go 1.24, released in February 2025, brings substantial performance improvements across the runtime, standard library, and toolchain. This release represents a significant step forward in performance optimization, with the most transformative change being the introduction of Swiss Tables for map implementations. Combined with rewrites to sync.Map, new cleanup APIs, and improved benchmark tooling, Go 1.24 delivers measurable gains for a wide variety of workloads.

1. Swiss Tables Map Implementation

Overview of Swiss Tables

Go 1.24 replaces the classic hash table implementation with Swiss Tables, a modern hash table design pioneered by Abseil (Google's C++ library). This change is one of the most significant performance upgrades in recent Go releases and affects every program that uses maps.

Swiss Tables achieve superior performance through a fundamentally different approach to collision resolution. Instead of chaining or linear probing on the full hash, Swiss Tables use group-based probing with compact metadata vectors. The implementation divides the hash table into 8-slot groups and stores a small "control byte" for each slot, enabling efficient batch lookups via SIMD-like operations.

How Swiss Tables Work

The Swiss Tables design relies on three key concepts:

1. Group-based slots: The hash table is divided into groups of 8 consecutive slots. Each group has its own control metadata, allowing independent operations.

2. H1 and H2 hash split: The full hash value is split into two parts:

H1 determines which group the key belongs to
H2 is stored in the control byte of the slot, enabling quick elimination of non-matching slots

3. Control bytes: Each slot contains a single byte (H2 value) plus the key/value. To find a key, the algorithm compares H2 values in parallel, eliminating mismatches quickly without examining actual keys.

Example of the control byte structure:

H2 value (8 bits):  10101011
Metadata (bits):    [   IsEmpty | Distance ]

Performance Gains

Swiss Tables deliver substantial performance improvements across common operations:

Lookups: ~30% faster on large maps (100k+ entries) due to reduced cache misses and better group iteration
Insertions: ~35% faster when preallocated, as the group-based structure reduces cluster growth
Iteration: up to 60% faster on low-load-factor maps, since the control bytes enable efficient slot scanning
Memory efficiency: Slightly tighter memory layout due to control byte packing; better cache utilization

Real-world validation comes from Datadog, which reported significant improvements in their Go services after internal benchmarking of Swiss Tables. Services experienced 5-15% CPU reductions on production workloads with heavy map usage.

Backward Compatibility

Swiss Tables are completely transparent to user code. There are no API changes to the map type, and all existing code continues to work without modification. The change is purely internal, replacing the hash table implementation beneath the surface.

If you need to revert to the old hash implementation for debugging or compatibility, you can disable Swiss Tables:

GOEXPERIMENT=noswissmap go run ./main.go

Benchmarks: Swiss Tables vs Old Implementation

Here's a realistic benchmark comparing old and new map implementations:

package main

import (
	"fmt"
	"testing"
)

// BenchmarkMapLookup tests lookup performance across different map sizes
func BenchmarkMapLookup(b *testing.B) {
	sizes := []int{10_000, 100_000, 1_000_000}

	for _, size := range sizes {
		b.Run(fmt.Sprintf("size=%d", size), func(b *testing.B) {
			m := make(map[string]int, size)
			for i := 0; i < size; i++ {
				m[fmt.Sprintf("key_%d", i)] = i
			}

			b.ResetTimer()
			for b.Loop() {
				for i := 0; i < size; i++ {
					key := fmt.Sprintf("key_%d", i)
					_ = m[key]
				}
			}
		})
	}
}

// BenchmarkMapInsert tests insertion performance with preallocation
func BenchmarkMapInsert(b *testing.B) {
	sizes := []int{10_000, 100_000, 1_000_000}

	for _, size := range sizes {
		b.Run(fmt.Sprintf("size=%d", size), func(b *testing.B) {
			b.ResetTimer()
			for b.Loop() {
				m := make(map[string]int, size)
				for i := 0; i < size; i++ {
					m[fmt.Sprintf("key_%d", i)] = i
				}
			}
		})
	}
}

// BenchmarkMapIterate tests iteration speed
func BenchmarkMapIterate(b *testing.B) {
	sizes := []int{10_000, 100_000, 1_000_000}

	for _, size := range sizes {
		b.Run(fmt.Sprintf("size=%d", size), func(b *testing.B) {
			m := make(map[string]int, size)
			for i := 0; i < size; i++ {
				m[fmt.Sprintf("key_%d", i)] = i
			}

			b.ResetTimer()
			for b.Loop() {
				sum := 0
				for _, v := range m {
					sum += v
				}
				_ = sum
			}
		})
	}
}

Expected results (Go 1.24 vs Go 1.23):

Operation	10k entries	100k entries	1M entries
Lookup	~5% faster	~25% faster	~30% faster
Insert	~10% faster	~30% faster	~35% faster
Iterate	~15% faster	~45% faster	~60% faster

Tip: The performance gains scale with map size. Smaller maps see modest improvements, while large maps benefit substantially. If your application uses large maps heavily, you'll see the most dramatic performance improvements in Go 1.24.

2. sync.Map Rewrite

The Problem with Old sync.Map

The previous sync.Map implementation used a global lock to manage concurrent modifications. While reads were fast (using atomic load), any write—even to disjoint keys—contended on the same lock. This made sync.Map unsuitable for workloads with concurrent modifications to different keys, even if those modifications never collided.

New Implementation Benefits

Go 1.24 completely rewrites sync.Map to use a disjoint keyspace design:

No write contention: Different goroutines can write to different keys without waiting for each other
Reduced lock scope: Each key has its own synchronization primitive, not a global one
Improved scalability: Mixed read/write workloads scale linearly with goroutine count
Backward compatible: The API remains unchanged; the improvement is internal

When to Use sync.Map

sync.Map is optimal for specific workload patterns:

High read ratio: When reads vastly outnumber writes
Disjoint key access: When different goroutines access different keys
Set-and-forget patterns: When keys are written once then read many times
Avoiding lock contention: When you need better scalability than RWMutex + map

For most use cases with significant write volume and potential key overlap, a sync.RWMutex protecting a regular map may still be preferable due to better cache locality.

Benchmark: Concurrent Read/Write

Here's a benchmark demonstrating the improvement in mixed workloads:

package main

import (
	"fmt"
	"sync"
	"testing"
)

// BenchmarkSyncMapMixed tests mixed read/write on disjoint keys
func BenchmarkSyncMapMixed(b *testing.B) {
	goroutines := []int{8, 16, 32}

	for _, g := range goroutines {
		b.Run(fmt.Sprintf("goroutines=%d", g), func(b *testing.B) {
			m := &sync.Map{}

			b.ResetTimer()
			for b.Loop() {
				var wg sync.WaitGroup
				wg.Add(g)

				for i := 0; i < g; i++ {
					go func(id int) {
						defer wg.Done()
						for j := 0; j < 100; j++ {
							// Each goroutine writes to its own key
							key := fmt.Sprintf("key_%d_%d", id, j)
							m.Store(key, j)

							// And reads from many keys
							for k := 0; k < 50; k++ {
								readKey := fmt.Sprintf("key_%d_%d", (k % g), j)
								_, _ = m.Load(readKey)
							}
						}
					}(i)
				}

				wg.Wait()
			}
		})
	}
}

// BenchmarkRWMutexMapMixed for comparison
func BenchmarkRWMutexMapMixed(b *testing.B) {
	goroutines := []int{8, 16, 32}

	for _, g := range goroutines {
		b.Run(fmt.Sprintf("goroutines=%d", g), func(b *testing.B) {
			var mu sync.RWMutex
			m := make(map[string]int)

			b.ResetTimer()
			for b.Loop() {
				var wg sync.WaitGroup
				wg.Add(g)

				for i := 0; i < g; i++ {
					go func(id int) {
						defer wg.Done()
						for j := 0; j < 100; j++ {
							key := fmt.Sprintf("key_%d_%d", id, j)
							mu.Lock()
							m[key] = j
							mu.Unlock()

							for k := 0; k < 50; k++ {
								readKey := fmt.Sprintf("key_%d_%d", (k % g), j)
								mu.RLock()
								_ = m[readKey]
								mu.RUnlock()
							}
						}
					}(i)
				}

				wg.Wait()
			}
		})
	}
}

Expected improvements (Go 1.24 sync.Map):

Goroutines	sync.Map (1.24) vs (1.23)
8	~15% faster
16	~35% faster
32	~50% faster

Tip: Benchmark your specific workload. The new sync.Map shines with disjoint key access patterns, but a sync.RWMutex + map might still win if you have high lock contention and few distinct keys.

3. runtime.AddCleanup — Replacing SetFinalizer

Motivation for AddCleanup

runtime.SetFinalizer has been the standard way to attach cleanup logic to objects, but it has limitations:

Only one finalizer per object
Resurrected objects survive another GC cycle (additional pressure)
Finalizer runs with the object as argument, preventing earlier garbage collection
No support for interior pointers or custom arguments

Go 1.24 introduces runtime.AddCleanup to address these shortcomings.

API Signature

func runtime.AddCleanup(ptr unsafe.Pointer, cleanup func(arg unsafe.Pointer), arg unsafe.Pointer)

Unlike SetFinalizer, which attaches to an object by value, AddCleanup operates on a pointer and takes a separate argument. This design enables multiple cleanups and prevents object resurrection.

Key Advantages

Multiple cleanups: Attach multiple cleanup functions to a single object
No resurrection: Objects won't survive an extra GC cycle
Custom arguments: Pass an arg to the cleanup function, not the object itself
Interior pointers: Can attach cleanup to a field within a struct, enabling resource tracking at finer granularity
GC efficiency: Cleaner semantics reduce GC overhead

Example Usage

package main

import (
	"fmt"
	"runtime"
	"unsafe"
)

type Connection struct {
	handle int
	closed bool
}

func (c *Connection) Close() {
	fmt.Printf("Closing connection %d\n", c.handle)
	c.closed = true
}

func main() {
	conn := &Connection{handle: 42}

	// Attach cleanup with AddCleanup
	cleanupArg := unsafe.Pointer(uintptr(conn.handle))
	runtime.AddCleanup(
		unsafe.Pointer(conn),
		func(arg unsafe.Pointer) {
			handle := int(uintptr(arg))
			fmt.Printf("Cleanup: releasing handle %d\n", handle)
		},
		cleanupArg,
	)

	// You can attach multiple cleanups
	runtime.AddCleanup(
		unsafe.Pointer(conn),
		func(arg unsafe.Pointer) {
			fmt.Println("Cleanup: logging connection closure")
		},
		nil,
	)

	// When conn is GC'd, both cleanup functions run
}

Benchmark: AddCleanup vs SetFinalizer

package main

import (
	"fmt"
	"runtime"
	"testing"
	"unsafe"
)

type Resource struct {
	id int
}

// BenchmarkSetFinalizer measures old SetFinalizer overhead
func BenchmarkSetFinalizer(b *testing.B) {
	count := 0
	b.ResetTimer()

	for b.Loop() {
		r := &Resource{id: 1}
		runtime.SetFinalizer(r, func(res *Resource) {
			count++
		})
		runtime.GC() // Force cleanup
	}
}

// BenchmarkAddCleanup measures new AddCleanup overhead
func BenchmarkAddCleanup(b *testing.B) {
	count := 0
	b.ResetTimer()

	for b.Loop() {
		r := &Resource{id: 1}
		runtime.AddCleanup(unsafe.Pointer(r), func(_ unsafe.Pointer) {
			count++
		}, nil)
		runtime.GC() // Force cleanup
	}
}

// BenchmarkMultipleCleanups shows the advantage of multiple cleanups
func BenchmarkMultipleCleanups(b *testing.B) {
	for b.Loop() {
		r := &Resource{id: 1}
		// Three separate cleanup functions, all on the same object
		for i := 0; i < 3; i++ {
			idx := i
			runtime.AddCleanup(
				unsafe.Pointer(r),
				func(_ unsafe.Pointer) {
					_ = idx
				},
				nil,
			)
		}
		runtime.GC()
	}
}

Expected results:

Benchmark	Go 1.24	Improvement
SetFinalizer	baseline	-
AddCleanup	~20% faster	20%
MultipleCleanups	~15% of 3x SetFinalizer cost	~55% savings

Tip: Use runtime.AddCleanup for new code. It's faster, allows multiple cleanups, and prevents subtle GC pathologies. Reserve SetFinalizer for legacy code.

4. testing.B.Loop — Better Benchmark Loops

The Problem with Traditional Benchmarks

In Go, the traditional benchmark loop looks like this:

func BenchmarkOld(b *testing.B) {
	for i := 0; i < b.N; i++ {
		// Benchmark code
		_ = expensiveOperation()
	}
}

The compiler can sometimes optimize away this code entirely! If the result is unused, the compiler treats it as dead code and eliminates it, giving meaningless benchmark results.

Enter testing.B.Loop

Go 1.24 introduces testing.B.Loop(), a better way to write benchmark loops:

func BenchmarkNew(b *testing.B) {
	for b.Loop() {
		_ = expensiveOperation()
	}
}

The Loop() method returns a boolean that's always true, preventing compiler dead-code elimination. It automatically prevents result caching and ensures each iteration is treated independently by the compiler's optimizer.

Key Benefits

Compiler-proof: Dead code elimination can't simplify away the loop
No SetTimer dance: No need for b.ResetTimer() after setup
Cleaner semantics: Expresses intent more clearly
Each iteration fresh: Each iteration is independently optimized, preventing batch optimizations

Example: The Danger of Dead Code Elimination

package main

import (
	"crypto/sha256"
	"testing"
)

// BenchmarkOldStyle — may be optimized away!
func BenchmarkOldStyle(b *testing.B) {
	data := []byte("benchmark payload")
	var result [32]byte

	b.ResetTimer()
	for i := 0; i < b.N; i++ {
		// Compiler might eliminate this if result isn't used
		result = sha256.Sum256(data)
	}
	_ = result
}

// BenchmarkNewStyle — guaranteed not optimized away
func BenchmarkNewStyle(b *testing.B) {
	data := []byte("benchmark payload")
	var result [32]byte

	for b.Loop() {
		result = sha256.Sum256(data)
	}
	_ = result
}

// BenchmarkWithSetup shows Loop() eliminating setup concerns
func BenchmarkWithSetup(b *testing.B) {
	expensive := expensiveSetup()
	defer expensive.Cleanup()

	b.StartTimer() // Old way required this
	for b.Loop() {
		_ = expensive.Process()
	}
}

func expensiveSetup() *ExpensiveResource {
	return &ExpensiveResource{}
}

type ExpensiveResource struct{}
func (e *ExpensiveResource) Process() int { return 42 }
func (e *ExpensiveResource) Cleanup() {}

Guidance: Always use b.Loop() in new benchmarks. It's simpler, more reliable, and safer against compiler optimizations.

Tip: Migrate existing benchmarks to use b.Loop() for more reliable results, especially for CPU-intensive operations.

5. CGO Annotations: noescape and nocallback

Reducing CGO Overhead

Calling C functions from Go incurs overhead: stack switching, pointer tracking, and callback preparation. Go 1.24 introduces two CGO annotations that eliminate unnecessary overhead for functions with known constraints.

#cgo noescape

The noescape annotation tells the compiler that a C function won't retain references to any Go pointers passed to it:

// In your C file
#cgo noescape c_hash_function

void c_hash_function(const uint8_t *data, size_t len, uint8_t *output)
{
	// This function only reads from data and writes to output
	// It doesn't store pointers for later use
	memcpy(output, data, len);
}

Without noescape, the Go runtime must preserve all passed pointers until the C function returns, preventing their garbage collection. With noescape, Go knows it can release the pointers immediately after the call.

#cgo nocallback

The nocallback annotation tells the compiler that a C function won't call back into Go (via function pointers or registered callbacks):

// In your C file
#cgo nocallback c_pure_computation

int c_pure_computation(int x, int y)
{
	// Pure C code, never calls Go functions
	return x + y;
}

Without nocallback, Go must be prepared for callbacks, maintaining special stack state. With nocallback, the compiler generates more efficient calling code.

Combined Impact: noescape and nocallback

When both annotations are present, the CGO call overhead is minimized:

// In your C file: hash.c
#cgo noescape blake3_hash
#cgo nocallback blake3_hash

void blake3_hash(const uint8_t *input, size_t input_len, uint8_t *output)
{
	// Pure computation, no Go callbacks, no pointer retention
	// This C function is called frequently
	blake3_impl(input, input_len, output);
}

Go Declaration

package main

/*
#cgo noescape blake3_hash
#cgo nocallback blake3_hash

void blake3_hash(const uint8_t *input, size_t input_len, uint8_t *output);
*/
import "C"

import (
	"unsafe"
)

func HashBlake3(data []byte) [32]byte {
	var output [32]byte
	C.blake3_hash((*C.uint8_t)(unsafe.Pointer(&data[0])), C.size_t(len(data)), (*C.uint8_t)(unsafe.Pointer(&output[0])))
	return output
}

Benchmark: Annotations Impact

package main

import (
	"fmt"
	"testing"
	"unsafe"
)

// Simulated C function (without annotations in this example, for illustration)
func c_compute_simple(x, y int) int {
	return x + y
}

// BenchmarkCGOWithoutAnnotations (higher overhead)
func BenchmarkCGOWithoutAnnotations(b *testing.B) {
	results := 0
	for b.Loop() {
		results += c_compute_simple(100, 200)
	}
	_ = results
}

// BenchmarkCGOWithAnnotations (lower overhead, faster)
func BenchmarkCGOWithAnnotations(b *testing.B) {
	// In real code, this would use the annotated C function
	results := 0
	for b.Loop() {
		// Compiler generates more efficient code
		results += c_compute_simple(100, 200)
	}
	_ = results
}

// BenchmarkCGOHashingWithoutAnnotations
func BenchmarkCGOHashingWithoutAnnotations(b *testing.B) {
	data := make([]byte, 1024)
	output := make([]byte, 32)

	for b.Loop() {
		// Without noescape/nocallback, the runtime tracks pointers
		_ = unsafe.Pointer(&data[0])
		_ = unsafe.Pointer(&output[0])
		// Simulated C call
		copy(output, data[:32])
	}
}

// BenchmarkCGOHashingWithAnnotations
func BenchmarkCGOHashingWithAnnotations(b *testing.B) {
	data := make([]byte, 1024)
	output := make([]byte, 32)

	for b.Loop() {
		// With noescape/nocallback, the runtime can optimize
		_ = unsafe.Pointer(&data[0])
		_ = unsafe.Pointer(&output[0])
		copy(output, data[:32])
	}
}

Expected improvements:

Scenario	Overhead Reduction
Pure computation (nocallback)	~15-20% faster
No pointer retention (noescape)	~10-15% faster
Both (noescape + nocallback)	~25-35% faster

Critical: Only use noescape and nocallback if you're absolutely certain about the C function's behavior. Incorrect annotations lead to undefined behavior, memory corruption, or dangling pointer bugs. Test thoroughly with tools like AddressSanitizer and MemorySanitizer.

6. Post-Quantum TLS and Crypto Performance

Post-Quantum Key Exchange

Go 1.24 includes hybrid post-quantum support in TLS via X25519MLKEM768, combining classical elliptic curve cryptography with post-quantum-resistant ML-KEM. This is enabled by default for TLS 1.3 connections.

The tradeoff: the post-quantum key material (~1KB) slightly increases TLS handshake size and computation. However, the benefits—protection against future quantum computers—justify the modest overhead.

// TLS with post-quantum support (automatic in Go 1.24)
package main

import (
	"crypto/tls"
	"net/http"
)

func main() {
	// Post-quantum TLS is enabled by default
	client := &http.Client{
		Transport: &http.Transport{
			TLSClientConfig: &tls.Config{
				// X25519MLKEM768 is negotiated automatically
			},
		},
	}

	resp, _ := client.Get("https://example.com")
	_ = resp
}

Faster Random: vDSO getrandom

Linux 6.11+ provides getrandom() via vDSO (virtual dynamic shared object), allowing userspace to call getrandom without syscall overhead.

Go 1.24 on compatible systems automatically uses vDSO for crypto/rand.Read(), achieving:

3-5x faster random number generation on systems with vDSO support
Reduced syscall overhead: No context switch required

package main

import (
	"crypto/rand"
	"fmt"
	"testing"
)

// BenchmarkCryptoRand shows improved performance
func BenchmarkCryptoRand(b *testing.B) {
	buf := make([]byte, 32)
	for b.Loop() {
		rand.Read(buf)
	}
}

func main() {
	// Random generation is automatically faster on Linux 6.11+
	buf := make([]byte, 32)
	rand.Read(buf)
	fmt.Printf("Random: %x\n", buf)
}

FIPS 140-3 Compliance Mode

Experimental FIPS 140-3 support via GOEXPERIMENT=fips140:

GOEXPERIMENT=fips140 go build ./main.go

This restricts cryptographic operations to FIPS-approved algorithms, useful for government and enterprise deployments.

New crypto/mlkem Package

The crypto/mlkem package provides direct access to ML-KEM (Module-Lattice-Based Key-Encapsulation Mechanism):

package main

import (
	"crypto/mlkem"
	"crypto/rand"
)

func main() {
	// Generate a key pair
	ek, dk, _ := mlkem.GenerateKey768(rand.Reader)

	// Encapsulate (shared secret generation)
	ct, ss, _ := mlkem.Encapsulate768(rand.Reader, ek)

	// Decapsulate (shared secret recovery)
	ss2, _ := mlkem.Decapsulate768(dk, ct)

	// ss and ss2 are identical
	_ = ss
	_ = ss2
}

7. Other Performance Changes

Overall CPU Overhead Reduction

Across the Go benchmark suite, representative workloads see a 2-3% CPU overhead reduction. This compounds with the targeted optimizations, creating broadly faster applications.

os.Root for Sandboxed Filesystem

The new os.Root type enables sandboxed filesystem operations, restricting I/O to a specific directory tree:

package main

import (
	"os"
)

func main() {
	// Create a chroot-like sandbox
	root, _ := os.OpenRoot("/safe/directory")
	defer root.Close()

	// All operations are relative to the root
	file, _ := root.Open("subdir/file.txt")
	defer file.Close()
}

This has modest performance benefits from reduced path traversal in some scenarios.

Generic Type Aliases (Full Support)

Generic type aliases are now fully supported, improving code reusability:

package main

type List[T any] []T
type Pair[K, V any] struct {
	Key   K
	Value V
}

func main() {
	list := List[int]{1, 2, 3}
	pair := Pair[string, int]{"count", 42}
	_ = list
	_ = pair
}

No performance impact; purely a language feature improvement.

go:wasmexport Directive

The new //go:wasmexport directive enables exporting Go functions to WebAssembly callers:

package main

//go:wasmexport add
func add(x, y int) int {
	return x + y
}

//go:wasmexport greeting
func greeting(name string) string {
	return "Hello, " + name
}

WASM binaries are smaller and more performant with native function exports.

Tool Directives in go.mod

Tool dependencies can now be specified with directives in go.mod:

tool go 1.24

This enables reproducible toolchain versions across projects.

Summary: Performance Strategy for Go 1.24

To maximize Go 1.24 performance gains:

Upgrade immediately: Maps, sync.Map, and random are significantly faster with no code changes
Migrate old code: Replace runtime.SetFinalizer with runtime.AddCleanup and old-style benchmark loops with b.Loop()
Profile before and after: Measure your applications to identify which optimizations matter most
Use CGO annotations carefully: Only annotate C functions when you're certain of their behavior
Leverage post-quantum TLS: The overhead is minimal, and the security benefits are substantial
Benchmark on target systems: Random performance depends on OS version; test on your deployment platform

Go 1.24 represents a maturing optimization culture in the language, with careful, targeted improvements rather than sweeping changes. The Swiss Tables implementation alone justifies upgrading, but the ecosystem of smaller optimizations makes this release a significant step forward in Go performance.

Performance in Go 1.24

On this page