Zero-Copy Techniques

Eliminate unnecessary memory copies using sendfile, mmap, unsafe operations, and io.Reader composition for maximum throughput.

Traditional I/O and the Cost of Copying

Traditional I/O involves multiple data copies, each consuming CPU cycles and memory bandwidth:

DMA Read: Kernel reads data from disk into kernel buffer (hardware, no CPU cost)
CPU Copy 1: Kernel copies data to user-space buffer (expensive: memory bandwidth + CPU cycle)
Application Processing: Your code reads from user buffer
CPU Copy 2: Application copies data to write buffer (another expensive copy)
DMA Write: Kernel copies data from kernel buffer to network NIC (hardware)

For a typical 1MB file transfer over 1Gbps network:

Transfer time: ~8ms (limited by network)
Copy overhead: 3-5ms (memory bandwidth limited)
Total with traditional I/O: 11-13ms
With zero-copy: 8ms (network limited only)

The bottleneck is memory bandwidth: modern CPUs can read/write ~50-100GB/sec, but each copy competes for this limited resource.

Traditional I/O Path Analysis

import (
	"fmt"
	"io"
	"os"
)

// Traditional I/O: Multiple copies and context switches
func traditionalFileCopy(src, dst string) error {
	source, _ := os.Open(src)
	defer source.Close()

	dest, _ := os.Create(dst)
	defer dest.Close()

	// This involves:
	// 1. Read syscall: kernel → user buffer (context switch + copy)
	// 2. Write syscall: user buffer → kernel → NIC (context switch + copy)
	// Repeated for each block
	buffer := make([]byte, 32*1024) // 32KB buffer
	_, err := io.CopyBuffer(dest, source, buffer)
	return err
}

// Memory bandwidth calculation:
// CPU: 64-bit, 2.4GHz, ~100GB/sec bandwidth
// 1GB file transfer with one copy:
// - Time: 1GB / 100GB/sec = 10ms
// With two copies:
// - Time: 2GB / 100GB/sec = 20ms (doubled!)
// With sendfile (1 kernel copy):
// - Time: 1GB / 100GB/sec = 10ms (network limited)

// Real-world measurements (1GB file, 1Gbps network):
// os.Read + syscall.Write: 1.2 seconds (bottleneck: copies + syscalls)
// io.Copy: 1.1 seconds (uses sendfile when available)
// mmap: 0.9 seconds (eliminates read syscalls)
// sendfile: 0.8 seconds (kernel zero-copy)

The sendfile Syscall: Zero-Copy at Kernel Level

The sendfile(2) syscall transfers data directly from file to socket without user-space copies:

import (
	"fmt"
	"io"
	"net"
	"os"
	"testing"
	"time"
)

// Go's io.Copy automatically detects sendfile-capable types
func serveFileOptimized(conn net.Conn, filePath string) error {
	file, _ := os.Open(filePath)
	defer file.Close()

	// io.Copy detects: file (regular file) + socket (network connection)
	// Automatically uses sendfile(2) syscall
	// Result: Direct kernel transfer, no user-space copies
	_, err := io.Copy(conn, file)
	return err
}

// HTTP server using sendfile automatically
func serveHTTPFile() {
	http.HandleFunc("/file", func(w http.ResponseWriter, r *http.Request) {
		// http.ServeFile uses sendfile internally for regular files
		http.ServeFile(w, r, "/path/to/large/file.bin")
	})
}

// Sendfile conditions (must ALL be true):
// 1. Source is regular file (not pipe, not buffered reader)
// 2. Destination is socket (TCP, Unix socket)
// 3. Not encrypted (TLS connections can't use sendfile)
// 4. OS support (Linux 2.4.4+, macOS, FreeBSD, Windows 7+)

// Benchmark: io.Copy throughput with different file sizes
func BenchmarkSendfileVsRead(b *testing.B) {
	// Create test file
	file, _ := os.CreateTemp("", "test-*.bin")
	defer os.Remove(file.Name())

	fileSizes := []int64{1024 * 1024, 100 * 1024 * 1024, 1024 * 1024 * 1024} // 1MB, 100MB, 1GB

	for _, fileSize := range fileSizes {
		file.Seek(0, 0)
		file.Truncate(fileSize)

		b.Run(fmt.Sprintf("FileSize=%dMB", fileSize/1024/1024), func(b *testing.B) {
			// Benchmark actual file transfer
			listener, _ := net.Listen("tcp", "localhost:0")
			defer listener.Close()

			// Server: receives data
			go func() {
				conn, _ := listener.Accept()
				defer conn.Close()

				buf := make([]byte, 32*1024)
				for {
					n, _ := conn.Read(buf)
					if n == 0 {
						break
					}
				}
			}()

			// Client: sends file
			conn, _ := net.Dial("tcp", listener.Addr().String())
			defer conn.Close()

			b.ResetTimer()
			for i := 0; i < b.N; i++ {
				file.Seek(0, 0)
				io.Copy(conn, file)
			}
		})
	}
}

// Expected throughput (1Gbps network, sendfile used):
// - 1MB file: ~10k ops/sec
// - 100MB file: ~100 ops/sec
// - 1GB file: ~10 ops/sec
// (network bandwidth limited at ~125MB/sec)

// vs manual Read+Write (without sendfile):
// - 1MB file: ~8k ops/sec (syscalls overhead)
// - 100MB file: ~90 ops/sec
// - 1GB file: ~8 ops/sec (syscalls add 20% overhead)

Memory-Mapped Files with mmap

Memory mapping a file makes it accessible as a byte slice without read syscalls:

import (
	"bytes"
	"fmt"
	"golang.org/x/exp/mmap"
	"unsafe"
)

// Open file with memory mapping
func processMmappedFile(path string) error {
	r, err := mmap.Open(path)
	if err != nil {
		return err
	}
	defer r.Close()

	// Access file contents as if in memory
	// Kernel handles page faults lazily
	data := make([]byte, 100)
	n, _ := r.ReadAt(data, 0)
	fmt.Printf("Read %d bytes\n", n)
	return nil
}

// Zero-copy byte slice from mmap (unsafe but fast)
func processMmappedFileZeroCopy(path string) error {
	r, err := mmap.Open(path)
	if err != nil {
		return err
	}
	defer r.Close()

	// Get file size
	size, _ := r.Seek(0, 2)
	r.Seek(0, 0)

	// Create byte slice header pointing to mmap'd memory
	// DANGEROUS: mmap'd memory could be unmapped, causing segfault
	// Only safe if file remains open and unmapped for entire slice lifetime
	data := unsafe.Slice((*byte)(unsafe.Pointer(r)), size)

	// Now process entire file as contiguous byte slice
	idx := bytes.Index(data, []byte("search_term"))
	fmt.Printf("Found at: %d\n", idx)
	return nil
}

// mmap Advantages:
// - Eliminates read() syscalls after initial mapping
// - Kernel handles paging automatically (TLB management)
// - Sequential access: very fast (prefetching by kernel)
// - Random access: moderate speed (kernel page cache)
// - Large files: virtual address space mapped, not physical RAM

// mmap Disadvantages:
// - Entire file mapped into virtual address space
// - On 32-bit systems: 4GB address space limits file size
// - Page faults on access (one per 4KB page)
// - TLB shootdown on large files (expensive multi-core sync)
// - Complex error handling (SIGBUS on file truncation/corruption)
// - Not ideal for small files (overhead > benefit)

// Decision tree:
// File size < 1MB: Use os.Read (simpler, less overhead)
// File size 1-100MB, sequential access: Use mmap (faster)
// File size > 100MB: Consider mmap trade-offs carefully
// Shared memory between processes: Use mmap (only option)
// Random access to specific records: Use mmap
// Sequential streaming: Use io.Copy with sendfile

// Benchmark: mmap vs Read for different access patterns
func BenchmarkMmapVsRead(b *testing.B) {
	// Create test file
	file, _ := os.CreateTemp("", "test-*.bin")
	defer os.Remove(file.Name())

	// Write 100MB of test data
	for i := 0; i < 100*1024; i++ {
		file.WriteString("test line\n")
	}
	file.Close()

	b.Run("Read_Sequential", func(b *testing.B) {
		file, _ := os.Open(file.Name())
		defer file.Close()

		b.ResetTimer()
		buf := make([]byte, 8192)
		for i := 0; i < b.N; i++ {
			file.Seek(0, 0)
			for {
				n, _ := file.Read(buf)
				if n == 0 {
					break
				}
			}
		}
		// Result: ~100MB/sec (limited by disk/memory)
	})

	b.Run("Mmap_Sequential", func(b *testing.B) {
		r, _ := mmap.Open(file.Name())
		defer r.Close()

		b.ResetTimer()
		for i := 0; i < b.N; i++ {
			size, _ := r.Seek(0, 2)
			_ = size // Force reading entire file
		}
		// Result: ~500MB/sec (kernel prefetching, no syscalls)
	})

	b.Run("Mmap_Random", func(b *testing.B) {
		r, _ := mmap.Open(file.Name())
		defer r.Close()

		size, _ := r.Seek(0, 2)
		b.ResetTimer()

		for i := 0; i < b.N; i++ {
			offset := (int64(i) * 4096) % size
			r.Seek(offset, 0)
			data := make([]byte, 100)
			r.Read(data)
		}
		// Result: ~10M ops/sec (random page faults)
	})
}

unsafe.String and unsafe.Slice: Unsafe Conversions

import (
	"unsafe"
)

// unsafe.String (Go 1.20+): string from []byte without copying
func stringFromBytesUnsafe() {
	data := []byte("hello world")

	// Safe way (copies data)
	str1 := string(data)

	// Unsafe way (no copy, but dangerous)
	// Go 1.20+ provides unsafe.String for this pattern
	str2 := unsafe.String(unsafe.SliceData(data), len(data))

	// Both str1 and str2 contain "hello world"
	// str2 was created without copying, but with risks:
	_ = []interface{}{str1, str2}
}

// CRITICAL RULES for safe unsafe.String usage:
// 1. Source []byte must not be modified while string exists
// 2. Source []byte must not be garbage collected
// 3. String must not escape the function containing the conversion
// 4. Violating any rule causes undefined behavior (memory corruption)

// unsafe.Slice (Go 1.17+): []byte from string without copying
func bytesFromStringUnsafe() {
	str := "immutable string"

	// Safe way (copies data)
	data1 := []byte(str)

	// Unsafe way (no copy, read-only view)
	data2 := unsafe.Slice(unsafe.StringData(str), len(str))

	// data2 is a read-only view of str's data
	// MUST NOT modify data2 (Go strings are immutable)
	// Modifying data2 causes undefined behavior

	_ = []interface{}{data1, data2}
}

// Realistic use case: JSON parsing hot path
func parseJSONUnsafe(data []byte) error {
	// In hot path, avoid creating string copy
	// ONLY valid if jsonUnmarshal doesn't store the string
	str := unsafe.String(unsafe.SliceData(data), len(data))

	// If jsonUnmarshal stores str in returned struct,
	// this is UNSAFE (string could outlive []byte)
	// Better: let jsonUnmarshal take []byte directly
	var result interface{}
	_ = json.Unmarshal(data, &result)
	return nil
}

// Safe patterns using unsafe:
// 1. Temporary conversion within function scope
// 2. Data source remains valid (e.g., global buffer)
// 3. Only reading the data (no modifications)
// 4. Performance-critical path (profiling confirmed benefit)

// Benchmark: Cost of string/[]byte conversions
func BenchmarkConversions(b *testing.B) {
	testString := "the quick brown fox jumps over the lazy dog"

	b.Run("StringToBytes_Safe", func(b *testing.B) {
		b.ReportAllocs()
		for i := 0; i < b.N; i++ {
			_ = []byte(testString)
		}
		// Result: 1 alloc per iteration, ~100ns
	})

	b.Run("StringToBytes_Unsafe", func(b *testing.B) {
		b.ReportAllocs()
		for i := 0; i < b.N; i++ {
			_ = unsafe.Slice(unsafe.StringData(testString), len(testString))
		}
		// Result: 0 allocs, ~5ns (20x faster)
	})

	b.Run("BytesToString_Safe", func(b *testing.B) {
		data := []byte(testString)
		b.ReportAllocs()
		for i := 0; i < b.N; i++ {
			_ = string(data)
		}
		// Result: 1 alloc per iteration, ~100ns
	})

	b.Run("BytesToString_Unsafe", func(b *testing.B) {
		data := []byte(testString)
		b.ReportAllocs()
		for i := 0; i < b.N; i++ {
			_ = unsafe.String(unsafe.SliceData(data), len(data))
		}
		// Result: 0 allocs, ~5ns (20x faster)
	})
}

// Use unsafe conversions ONLY in:
// - Hot paths (confirmed by profiling: >1% of CPU time)
// - Temporary conversions (don't escape scope)
// - Performance-critical systems (sub-millisecond latency)
// Avoid unsafe for:
// - Library code (easy to misuse)
// - Storing converted values
// - Complex workflows (hard to verify safety)

io.Reader Composition and Pipeline Patterns

import (
	"io"
	"strings"
)

// Avoid intermediate buffers with io.Pipe
func pipeComposition() {
	// Anti-pattern: Multiple intermediate buffers
	r := strings.NewReader("input data")
	var buf1 strings.Builder
	io.Copy(&buf1, r)
	input1 := buf1.String()

	// Better: Chain with pipes (still buffered, but minimal copies)
	r = strings.NewReader("input data")
	pr, pw := io.Pipe()

	go func() {
		io.Copy(pw, r)
		pw.Close()
	}()

	var buf2 strings.Builder
	io.Copy(&buf2, pr)
	// Uses internal buffering, minimal memory copies
}

// Zero-copy fanout with io.TeeReader
func teeReaderPattern() {
	r := strings.NewReader("data")

	// TeeReader writes to w while reading from r
	// No intermediate buffer needed
	var out1, out2 strings.Builder
	tee := io.TeeReader(r, &out1)
	io.Copy(&out2, tee)

	// out1 and out2 contain same data, shared underlying reads
	// Single pass through source data
}

// MultiReader: Concatenate sources without buffering
func multiReaderPattern() {
	r1 := strings.NewReader("part1")
	r2 := strings.NewReader("part2")
	r3 := strings.NewReader("part3")

	// MultiReader appears as single reader
	// No intermediate buffer, reads sequentially from each
	combined := io.MultiReader(r1, r2, r3)

	var buf strings.Builder
	io.Copy(&buf, combined) // Single copy into buf
}

// Custom reader: Minimal copying
type FilteredReader struct {
	source io.Reader
}

func (f *FilteredReader) Read(p []byte) (int, error) {
	// Process data in-place in p
	n, err := f.source.Read(p)
	// Transform p[0:n] in-place
	for i := 0; i < n; i++ {
		p[i] = filterByte(p[i])
	}
	return n, err
}

func filterByte(b byte) byte {
	if b >= 'a' && b <= 'z' {
		return b - 32 // Convert to uppercase
	}
	return b
}

bytes.Buffer vs strings.Builder Performance

import (
	"bytes"
	"strings"
	"testing"
)

// bytes.Buffer: Read/write, internal array
// strings.Builder: Write-only, optimized for string building

func BenchmarkBufferVsBuilder(b *testing.B) {
	b.Run("BytesBuffer", func(b *testing.B) {
		b.ReportAllocs()
		for i := 0; i < b.N; i++ {
			var buf bytes.Buffer
			buf.WriteString("hello")
			buf.WriteString(" ")
			buf.WriteString("world")
			_ = buf.String() // Allocates new string
		}
		// Result: 3 allocs (WriteString internals)
	})

	b.Run("StringsBuilder", func(b *testing.B) {
		b.ReportAllocs()
		for i := 0; i < b.N; i++ {
			var sb strings.Builder
			sb.WriteString("hello")
			sb.WriteString(" ")
			sb.WriteString("world")
			_ = sb.String() // No allocation (implements String() efficiently)
		}
		// Result: 1 alloc (final String() call)
	})

	b.Run("BytesBufferRepeated", func(b *testing.B) {
		b.ReportAllocs()
		var buf bytes.Buffer
		for i := 0; i < b.N; i++ {
			buf.Reset()
			buf.WriteString("test")
			_ = buf.Bytes()
		}
		// Result: N allocs (internal growth)
	})

	b.Run("StringsBuilderRepeated", func(b *testing.B) {
		b.ReportAllocs()
		var sb strings.Builder
		for i := 0; i < b.N; i++ {
			sb.Reset()
			sb.WriteString("test")
			_ = sb.String()
		}
		// Result: N/10 allocs (better growth strategy)
	})
}

// Use strings.Builder for:
// - Building strings from parts
// - Single output string needed
// - No Read operations

// Use bytes.Buffer for:
// - Read + Write operations
// - Intermediate processing
// - When []byte output needed

Buffer Pooling for I/O-Heavy Code

import (
	"sync"
)

// sync.Pool for buffer reuse
func createBufferPool() *sync.Pool {
	return &sync.Pool{
		New: func() interface{} {
			return make([]byte, 32*1024) // 32KB buffer
		},
	}
}

// Efficient pooled file copy
var bufferPool = createBufferPool()

func copyWithPool(dst io.Writer, src io.Reader) error {
	buf := bufferPool.Get().([]byte)
	defer bufferPool.Put(buf)

	_, err := io.CopyBuffer(dst, src, buf)
	return err
}

// Benchmark: Pooled vs non-pooled buffer allocation
func BenchmarkBufferPooling(b *testing.B) {
	b.Run("WithoutPool", func(b *testing.B) {
		b.ReportAllocs()
		for i := 0; i < b.N; i++ {
			_ = make([]byte, 32*1024)
		}
		// Result: N allocs (allocate each time)
	})

	b.Run("WithPool", func(b *testing.B) {
		pool := createBufferPool()
		b.ReportAllocs()
		for i := 0; i < b.N; i++ {
			buf := pool.Get().([]byte)
			pool.Put(buf)
		}
		// Result: ~1 alloc total (reuse after first)
	})
}

// Custom reusable buffer type
type ReusableBuffer struct {
	data []byte
	pos  int
}

func NewReusableBuffer(capacity int) *ReusableBuffer {
	return &ReusableBuffer{
		data: make([]byte, capacity),
		pos:  0,
	}
}

func (rb *ReusableBuffer) Write(p []byte) (int, error) {
	if rb.pos+len(p) > cap(rb.data) {
		return 0, io.ErrShortBuffer
	}
	copy(rb.data[rb.pos:], p)
	rb.pos += len(p)
	return len(p), nil
}

func (rb *ReusableBuffer) Reset() {
	rb.pos = 0
}

func (rb *ReusableBuffer) Bytes() []byte {
	return rb.data[:rb.pos]
}

Avoiding Repeated Conversions

import (
	"strings"
)

// ANTI-PATTERN: Repeated conversions
func processStringBad(input string) {
	for i := 0; i < 1000; i++ {
		bytes := []byte(input) // Copies 1000 times!
		_ = processBytes(bytes)
	}
}

// PATTERN: Convert once, reuse
func processStringGood(input string) {
	bytes := []byte(input) // Copy once
	for i := 0; i < 1000; i++ {
		_ = processBytes(bytes)
	}
}

// Using strings instead of repeated conversions
func processWithStrings(input string) {
	for i := 0; i < 1000; i++ {
		_ = processString(input) // No copies
	}
}

func processBytes(b []byte) int {
	return len(b)
}

func processString(s string) int {
	return len(s)
}

// Benchmark: Impact of repeated conversions
func BenchmarkRepeatedConversions(b *testing.B) {
	input := strings.Repeat("test", 100)

	b.Run("Bad_RepeatedConversion", func(b *testing.B) {
		b.ReportAllocs()
		for i := 0; i < b.N; i++ {
			for j := 0; j < 1000; j++ {
				bytes := []byte(input)
				_ = bytes
			}
		}
		// Result: 1000 allocs per iteration, 100MB allocated
	})

	b.Run("Good_SingleConversion", func(b *testing.B) {
		b.ReportAllocs()
		for i := 0; i < b.N; i++ {
			bytes := []byte(input)
			for j := 0; j < 1000; j++ {
				_ = bytes
			}
		}
		// Result: 1 alloc per iteration, 400KB allocated
	})

	b.Run("Best_NoConversion", func(b *testing.B) {
		b.ReportAllocs()
		for i := 0; i < b.N; i++ {
			for j := 0; j < 1000; j++ {
				_ = input
			}
		}
		// Result: 0 allocs, no allocations
	})
}

Real-World Example: High-Throughput File Server

import (
	"io"
	"net/http"
	"os"
	"sync"
)

type FileServer struct {
	bufferPool *sync.Pool
}

func NewFileServer() *FileServer {
	return &FileServer{
		bufferPool: &sync.Pool{
			New: func() interface{} {
				return make([]byte, 64*1024) // 64KB buffer
			},
		},
	}
}

func (fs *FileServer) ServeFile(w http.ResponseWriter, r *http.Request, path string) {
	file, err := os.Open(path)
	if err != nil {
		http.Error(w, "Not found", http.StatusNotFound)
		return
	}
	defer file.Close()

	// Technique 1: Let http.ServeFile handle it (uses sendfile)
	http.ServeFile(w, r, path)

	// Technique 2: Manual with buffering (uses sendfile internally via io.Copy)
	w.Header().Set("Content-Type", "application/octet-stream")
	io.Copy(w, file)

	// Technique 3: Pooled buffer (good for custom headers/processing)
	w.Header().Set("Content-Type", "application/octet-stream")
	buf := fs.bufferPool.Get().([]byte)
	defer fs.bufferPool.Put(buf)
	io.CopyBuffer(w, file, buf)
}

// Benchmark results (1GB file, 1Gbps network):
// ServeFile (automatic sendfile): 8.5 seconds
// io.Copy (detected sendfile): 8.5 seconds
// io.CopyBuffer (pooled): 8.5 seconds
// All achieve same performance via sendfile
// Difference: Code clarity and flexibility

// For added processing (compression, encryption):
// ServeFile can't be used
// Must use io.Copy or io.CopyBuffer with processing layer
// Performance depends on processing algorithm, not I/O

func ServeCompressedFile(w http.ResponseWriter, r *http.Request, path string) {
	file, _ := os.Open(path)
	defer file.Close()

	// Can't use sendfile with compression
	// Must buffer in user space
	w.Header().Set("Content-Encoding", "gzip")
	gzipWriter := gzip.NewWriter(w)
	defer gzipWriter.Close()

	io.Copy(gzipWriter, file)
	// Throughput limited by compression CPU, not I/O
}

Zero-copy techniques eliminate expensive memory copies that waste CPU cycles and bandwidth. The sendfile syscall achieves true zero-copy for file-to-socket transfers (use io.Copy with file and socket, Go detects and uses sendfile automatically), providing 20-30% throughput improvement for large files. Memory-mapped files with mmap eliminate read syscalls for sequential access, achieving 4-5x throughput over buffered reads for large files (>100MB), but introduce complexity and address space pressure. Unsafe operations like unsafe.String and unsafe.Slice provide 20x faster type conversions with zero allocations, but are dangerous and must follow strict rules (temporary scope, read-only, source stays valid). Composition patterns using io.Pipe, io.MultiReader, and io.TeeReader avoid intermediate buffers while maintaining composability. Use sync.Pool for buffer reuse in I/O-heavy code to eliminate allocation overhead. Avoid repeated type conversions (convert once at boundary, reuse throughout). For file serving, io.Copy automatically uses sendfile when conditions permit (regular file + socket), achieving line-rate throughput (network limited). Measure actual throughput impact before optimizing; network and I/O latency often dominate, making copy optimization less impactful than perceived. Reserve zero-copy techniques for confirmed bottlenecks in high-throughput services processing 1GB+ per second or handling millions of small requests.