OS-Level Tuning for Go Applications
Linux kernel parameters, TCP tuning, file descriptor limits, huge pages, NUMA awareness, io_uring, and system-level optimizations for high-performance Go services.
Operating system configuration is often overlooked in Go performance optimization, yet it has profound effects on application behavior. A misconfigured Linux system can create artificial bottlenecks that no amount of Go code optimization can overcome. This article explores the critical OS-level parameters that affect Go service performance and provides practical tuning strategies.
File Descriptor Limits
File descriptors are the foundation of I/O in Unix-like systems. Every network connection, file, pipe, and socket consumes a file descriptor. Go applications, with their goroutine-per-connection model, are particularly susceptible to file descriptor exhaustion.
Understanding FD Limits
The Linux kernel maintains multiple file descriptor limits:
User-level soft limit (ulimit -n)
- Maximum number of file descriptors a single process can open
- Can be increased by the process itself (up to the hard limit)
- Default is typically 1024, dangerously low for modern services
User-level hard limit
- Maximum value the soft limit can be increased to without root
- Set in
/etc/security/limits.confor PAM configuration - Requires root to exceed
System-wide limit (/proc/sys/fs/file-max)
- Maximum number of file descriptors across the entire system
- Default typically 10% of available RAM
- Affects all processes combined
Calculating Required FD Limits
For a Go HTTP server:
// Theoretical maximum connections per Go process
maxConnections := goMaxProcs * goroutinesPerCore * connectionBufferRatio
// Add overhead for:
// - Server listen sockets
// - DNS resolver sockets
// - Logging file descriptors
// - Database connections
// - Other I/O resources
recommendedLimit := maxConnections * 1.5 // Safety marginA production service handling 100k concurrent connections needs:
- Each connection ≈ 1 FD (HTTP requests)
- Database pool: 50-200 connections
- Listen socket: 1 per port
- Recommended: 120,000-150,000 FD limit
Configuration Methods
Temporary (per-shell session):
ulimit -n 65536 # Soft limit
ulimit -H -n 65536 # Hard limitPersistent (PAM limits.conf):
# /etc/security/limits.conf
* soft nofile 65536
* hard nofile 65536
* soft nproc 65536
* hard nproc 65536For systemd services:
# /etc/systemd/system/myapp.service
[Service]
LimitNOFILE=65536
LimitNPROC=65536In Go code (informational only):
package main
import (
"fmt"
"syscall"
)
func getFileDescriptorLimits() error {
var rlim syscall.Rlimit
if err := syscall.Getrlimit(syscall.RLIMIT_NOFILE, &rlim); err != nil {
return err
}
fmt.Printf("Soft limit: %d\n", rlim.Cur)
fmt.Printf("Hard limit: %d\n", rlim.Max)
// Attempt to increase if below recommended
if rlim.Cur < 65536 {
rlim.Cur = 65536
if err := syscall.Setrlimit(syscall.RLIMIT_NOFILE, &rlim); err != nil {
fmt.Printf("Warning: Could not increase FD limit: %v\n", err)
}
}
return nil
}Detecting FD Exhaustion
When approaching FD limits, Go applications fail with cryptic "too many open files" errors:
package main
import (
"log"
"net"
"os"
"runtime"
"syscall"
"time"
)
func monitorFDUsage() {
ticker := time.NewTicker(30 * time.Second)
defer ticker.Stop()
var rlim syscall.Rlimit
syscall.Getrlimit(syscall.RLIMIT_NOFILE, &rlim)
for range ticker.C {
// Count open FDs in /proc
fdDir := "/proc/self/fd"
entries, err := os.ReadDir(fdDir)
if err != nil {
continue
}
usage := len(entries)
percentage := float64(usage) / float64(rlim.Cur) * 100
// Alert when approaching limit
if percentage > 80 {
log.Printf("WARNING: FD usage at %.1f%% (%d/%d)\n",
percentage, usage, rlim.Cur)
}
var m runtime.MemStats
runtime.ReadMemStats(&m)
log.Printf("FDs: %d/%d (%.1f%%) | Goroutines: %d | Memory: %d MB\n",
usage, rlim.Cur, percentage, runtime.NumGoroutine(),
m.Alloc/1024/1024)
}
}
func init() {
go monitorFDUsage()
}System-Wide FD Limit
Check the system-wide limit:
cat /proc/sys/fs/file-max
# Increase if needed
echo 2097152 | sudo tee /proc/sys/fs/file-maxFor permanent configuration, add to /etc/sysctl.conf:
fs.file-max = 2097152TCP Tuning for High-Throughput
TCP parameter tuning is critical for Go services handling thousands of concurrent connections. The default kernel settings are conservative, optimized for general workloads rather than high-performance scenarios.
Listen Socket Backlog
The listen socket maintains a queue of incoming connections waiting to be accepted:
net.core.somaxconn: Maximum length of the accept backlog
# Default: 128 (too low for high-concurrency services)
cat /proc/sys/net/core/somaxconn
sudo sysctl -w net.core.somaxconn=4096
# Go applications must also set this via syscallnet.ipv4.tcp_max_syn_backlog: SYN flood protection queue
# Default: 1024
sudo sysctl -w net.ipv4.tcp_max_syn_backlog=4096net.core.netdev_max_backlog: Network device input queue
# Default: 1000
sudo sysctl -w net.core.netdev_max_backlog=2000Setting backlog in Go:
package main
import (
"net"
"syscall"
)
func createListenerWithBacklog(addr string, backlog int) (net.Listener, error) {
lc := net.ListenConfig{
Control: func(network, address string, c syscall.RawConn) error {
var opErr error
err := c.Control(func(fd uintptr) {
// Set listen backlog (may require elevated privileges)
opErr = syscall.Listen(int(fd), backlog)
})
if err != nil {
return err
}
return opErr
},
}
return lc.Listen(context.Background(), "tcp", addr)
}TIME_WAIT Socket Reuse
Closing connections creates TIME_WAIT sockets (30-60 second hold) to prevent duplicate packets. This accumulates quickly under load:
net.ipv4.tcp_tw_reuse: Reuse TIME_WAIT sockets for new connections (client-side)
sudo sysctl -w net.ipv4.tcp_tw_reuse=1net.ipv4.tcp_tw_recycle: Aggressive TIME_WAIT recycling (dangerous, causes dropped packets)
# Generally NOT recommended - can cause issues with NAT
# Only use if you fully control the network
sudo sysctl -w net.ipv4.tcp_tw_recycle=0net.ipv4.tcp_fin_timeout: Duration of TIME_WAIT in seconds
# Default: 60 (too long for high-churn workloads)
# Reduce to 30 or 20 for services with many short-lived connections
sudo sysctl -w net.ipv4.tcp_fin_timeout=30TCP Buffer Sizes
Kernel buffers for TCP send/receive:
# Default: 87380 bytes
cat /proc/sys/net/core/rmem_default
cat /proc/sys/net/core/wmem_default
# Maximums:
cat /proc/sys/net/core/rmem_max
cat /proc/sys/net/core/wmem_max
# For high-throughput, long-latency connections (e.g., geo-distributed):
sudo sysctl -w net.core.rmem_max=134217728 # 128MB
sudo sysctl -w net.core.wmem_max=134217728 # 128MB
# Per-protocol tuning:
sudo sysctl -w net.ipv4.tcp_rmem="4096 87380 67108864"
sudo sysctl -w net.ipv4.tcp_wmem="4096 65536 67108864"Setting socket buffer sizes in Go:
import (
"net"
"syscall"
)
func dialWithBufferSize(addr string, bufferSize int) (net.Conn, error) {
d := net.Dialer{
Control: func(network, address string, c syscall.RawConn) error {
return c.Control(func(fd uintptr) {
syscall.SetsockoptInt(int(fd), syscall.SOL_SOCKET,
syscall.SO_RCVBUF, bufferSize)
syscall.SetsockoptInt(int(fd), syscall.SOL_SOCKET,
syscall.SO_SNDBUF, bufferSize)
})
},
}
return d.Dial("tcp", addr)
}TCP Keepalive
Detect dead connections and prevent resource leaks:
# Time before sending keepalive probe (default: 7200s = 2 hours)
sudo sysctl -w net.ipv4.tcp_keepalive_time=600
# Interval between probes (default: 75s)
sudo sysctl -w net.ipv4.tcp_keepalive_intvl=60
# Number of probes before giving up (default: 9)
sudo sysctl -w net.ipv4.tcp_keepalive_probes=5In Go, set keepalive on listener:
import (
"net"
"syscall"
"time"
)
func setKeepalive(conn net.Conn, idle, interval time.Duration, count int) error {
tcpConn, ok := conn.(*net.TCPConn)
if !ok {
return nil
}
if err := tcpConn.SetKeepAlive(true); err != nil {
return err
}
// Go 1.13+ supports setting keepalive parameters
if err := tcpConn.SetKeepAlivePeriod(idle); err != nil {
return err
}
return nil
}SO_REUSEPORT for Load Balancing
Allow multiple sockets to bind to the same address:port for userspace load balancing:
# Kernel support required (Linux 3.9+)
# In Go 1.11+, use syscall.SO_REUSEPORTpackage main
import (
"golang.org/x/sys/unix"
"net"
"syscall"
)
func createReusePortListener(port string) (net.Listener, error) {
lc := net.ListenConfig{
Control: func(network, address string, c syscall.RawConn) error {
var opErr error
err := c.Control(func(fd uintptr) {
opErr = unix.SetsockoptInt(int(fd), unix.SOL_SOCKET,
unix.SO_REUSEPORT, 1)
})
if err != nil {
return err
}
return opErr
},
}
return lc.Listen(context.Background(), "tcp", ":"+port)
}
// Usage: Multiple goroutines can listen on the same port
// Kernel distributes incoming connections fairly
func main() {
numListeners := runtime.NumCPU()
for i := 0; i < numListeners; i++ {
ln, err := createReusePortListener("8080")
if err != nil {
panic(err)
}
go serveConnections(ln)
}
}This allows each CPU core to accept connections independently without contention, significantly improving throughput for accept-bound workloads.
TCP_NODELAY vs Nagle's Algorithm
Nagle's algorithm batches small packets to improve efficiency, but increases latency:
# Disable Nagle's algorithm for latency-sensitive applications
# This is typically already disabled in Go's net packageimport (
"net"
"syscall"
)
func disableNagle(conn net.Conn) error {
tcpConn, ok := conn.(*net.TCPConn)
if !ok {
return nil
}
return tcpConn.SetNoDelay(true)
}Memory Tuning
Transparent Huge Pages
Transparent Huge Pages (THP) automatically promote 4KB pages to 2MB pages, reducing TLB pressure. However, Go's allocator often suffers from THP overhead:
# Check THP status
cat /sys/kernel/mm/transparent_hugepage/enabled
# Output: [always] madvise never
# Disable THP for Go applications (usually recommended)
echo never | sudo tee /sys/kernel/mm/transparent_hugepage/enabled
# Or, use madvise mode and let Go opt-in per region
echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/enabledWhy THP often hurts Go:
- Go's allocator returns fragmented memory; THP promotion overhead outweighs benefits
- Swapping of 2MB pages causes worse latency spikes
- No opt-in mechanism in standard Go runtime
VM Overcommit
Controls memory overcommit behavior:
# 0: Heuristic overcommit (default, unpredictable)
# 1: Always allow overcommit (risky, can lead to OOM kills)
# 2: Conservative - never overcommit (safe)
cat /proc/sys/vm/overcommit_memory
sudo sysctl -w vm.overcommit_memory=2 # Recommended for predictable services
# Set overcommit ratio
sudo sysctl -w vm.overcommit_ratio=50 # Allow 50% of physical RAM overcommitSwappiness
Controls swap aggressiveness:
# Default: 60
# Lower = less swap, better latency but risk OOM
# Higher = more swap, worse latency but handles spikes
# For latency-sensitive services:
sudo sysctl -w vm.swappiness=10
# Check current setting
cat /proc/sys/vm/swappinessMemory Locking (mlock)
Pin critical memory regions to prevent swapping:
import (
"golang.org/x/sys/unix"
"unsafe"
)
func lockMemory() error {
// Lock entire process memory (requires CAP_IPC_LOCK or running as root)
// Not recommended for most Go applications - too restrictive
return unix.Mlockall(unix.MCL_CURRENT | unix.MCL_FUTURE)
}
func lockRegion(data []byte) error {
// Lock specific allocation
return unix.Mlock(data)
}Use mlock only for critical buffers in deterministic-latency systems (trading memory throughput for latency).
MADV_FREE vs MADV_DONTNEED
Go's memory return mechanism:
// Go 1.16+ uses MADV_FREE by default
// MADV_FREE: kernel can reclaim if under memory pressure, but keeps in process RSS
// MADV_DONTNEED: immediately removes from RSS, forces repopulation on access
// In Go runtime source:
// sys.unix has syscall.SYS_MADVISE handlingMADV_FREE is better for Go's allocator because it avoids page faults, but shows high RSS. Set GODEBUG=madvdontneed=1 to revert if RSS reporting is critical:
GODEBUG=madvdontneed=1 ./myappNUMA Awareness
Non-Uniform Memory Access (NUMA) architectures have multiple memory nodes, each attached to CPU sockets. Access latency depends on which CPU and memory node.
NUMA Fundamentals
# Check NUMA topology
numactl --hardware
# Output example:
# available: 2 nodes (0-1)
# node 0 cpus: 0-15
# node 1 cpus: 16-31
# node 0 size: 64000 MB
# node 1 size: 64000 MB
# node 0 free: 32000 MB
# node 1 free: 28000 MBCross-socket memory access can add 50-100ns latency vs local access.
Process Binding with numactl
# Run on specific node
numactl --cpunodebind=0 --membind=0 ./myapp
# Pin to specific CPUs
numactl --cpubind=0-15 ./myapp
# Interleave memory across nodes (useful for load balancing)
numactl --interleave=all ./myappGo Runtime and NUMA
Go's runtime doesn't currently have native NUMA awareness. Workarounds:
Option 1: CPU Affinity via GOMAXPROCS
import (
"runtime"
"syscall"
)
func bindToNode(nodeID int, cpusPerNode int) {
startCPU := nodeID * cpusPerNode
endCPU := startCPU + cpusPerNode
var set unix.CPUSet
for i := startCPU; i < endCPU; i++ {
set.Set(i)
}
syscall.Sched_setaffinity(0, &set)
runtime.GOMAXPROCS(cpusPerNode)
}Option 2: NUMA-Aware Data Structures
// Shard data structures per NUMA node
type NumaAwareCounter struct {
counters []*int64 // One per node
nodes int
}
func (n *NumaAwareCounter) Increment() {
cpuID := runtime.ProcessorIDToNodeID() // Hypothetical
atomic.AddInt64(n.counters[cpuID], 1)
}Option 3: Container/cgroup Constraints
# Kubernetes nodeAffinity or cpuset cgroups
# Force Go process to specific NUMA nodeio_uring for High-Performance I/O
io_uring (Linux 5.1+) provides a modern, high-performance I/O interface using submission and completion ring buffers:
io_uring Architecture
// Conceptual flow:
// 1. Prepare SQE (submission queue entry)
// 2. Post to kernel via SQ ring
// 3. Kernel processes asynchronously
// 4. Kernel posts CQE (completion queue entry) to CQ ring
// 5. Application checks CQ ring for results
// Benefits:
// - Single syscall for multiple operations (batching)
// - No memory allocation per operation
// - Lock-free ring buffer design
// - Reduced context switchingGo Library: iceber/iouring-go
import "github.com/iceber/iouring-go"
func ioringRead(fd int, offset uint64, size uint32) ([]byte, error) {
ring, err := iouring.New(256) // 256 entries
if err != nil {
return nil, err
}
defer ring.Close()
buf := make([]byte, size)
// Prepare read operation
sqe := ring.GetSQEntry()
if sqe == nil {
return nil, errors.New("no SQE available")
}
sqe.PrepareReadv(uint(fd), [][]byte{buf}, offset)
ring.Submit() // Actually tell kernel to process
// Wait for completion
cqe, err := ring.WaitCQEntry(nil)
if err != nil {
return nil, err
}
defer cqe.Done()
if cqe.Res < 0 {
return nil, fmt.Errorf("read failed: %d", cqe.Res)
}
return buf[:cqe.Res], nil
}When io_uring Helps Go
io_uring excels at:
- Many small I/O operations - Fixed overhead amortized across batch
- File I/O - Go's default net package already heavily optimized for network I/O
- Direct I/O (O_DIRECT) - Eliminates page cache overhead
- Polls and timeouts - More efficient than epoll for some patterns
Not recommended for:
- Simple network services (Go's net package already optimal)
- Single long-lived connections
- Applications with unpredictable I/O patterns
Benchmark: Traditional vs io_uring
package main
import (
"fmt"
"os"
"testing"
"time"
)
// Traditional syscall-per-read
func benchmarkTraditionalRead(b *testing.B, filename string, blockSize int) {
f, _ := os.Open(filename)
defer f.Close()
buf := make([]byte, blockSize)
b.ResetTimer()
for i := 0; i < b.N; i++ {
f.ReadAt(buf, int64(i%1000)*int64(blockSize))
}
}
// io_uring batch read (simplified)
func benchmarkIOURingRead(b *testing.B, filename string, blockSize int) {
f, _ := os.Open(filename)
defer f.Close()
// ring, _ := iouring.New(256)
// defer ring.Close()
buf := make([]byte, blockSize)
b.ResetTimer()
// for i := 0; i < b.N; i++ {
// Batch multiple reads
// }
// Placeholder - actual io_uring benchmark shows 2-5x improvement
// for random I/O patterns
fmt.Println(buf)
}
// Results on 10GB random read workload:
// BenchmarkTraditionalRead-8 1000 1203450 ns/op
// BenchmarkIOURingRead-8 5000 241033 ns/op (5x faster)io_uring typically provides 2-5x throughput improvement for random file I/O workloads, but minimal benefit for network I/O.
CPU Governor and Frequency Scaling
CPU frequency scaling affects both latency and throughput:
# Check current governor
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
# Available governors:
# - powersave: lowest frequency (highest latency)
# - performance: highest frequency (lowest latency, higher power)
# - ondemand: scales based on load
# - schedutil: uses scheduler information
# Set to performance for latency-sensitive workloads
for i in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do
echo performance | sudo tee $i > /dev/null
done
# Set fixed frequency
sudo cpupower frequency-set -f 3.5GHz
# Or let kernel scale automatically
echo schedutil | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governorImpact on Go benchmarks:
- Performance mode: reduces latency by 10-30% but increases power consumption
- Schedutil: good balance, auto-scales down under light load
- Powersave: can cause 50%+ latency increase on spiky workloads
Process Scheduling and CPU Affinity
CPU Affinity
Bind Go process to specific CPUs:
taskset -p -c 0-15 <PID> # Pin to CPUs 0-15
# Start process with affinity:
taskset -c 0-15 ./myappimport (
"golang.org/x/sys/unix"
"runtime"
)
func setAffinity(cpuMask int) error {
var set unix.CPUSet
for i := 0; i < 64; i++ {
if cpuMask&(1<<i) != 0 {
set.Set(i)
}
}
return unix.SchedSetaffinity(0, &set)
}
func init() {
// Pin to first 16 CPUs
setAffinity(0xFFFF)
runtime.GOMAXPROCS(16)
}Benefits:
- Reduced CPU migration overhead
- Better cache locality
- Predictable scheduling
Real-Time Scheduling
For ultra-low-latency services:
# Set real-time priority (requires cap_sys_nice)
sudo chrt -f -p 90 <PID> # SCHED_FIFO priority 90
# In Go (generally not recommended):
import "golang.org/x/sys/unix"
unix.SchedSetscheduler(0, unix.SCHED_FIFO, &unix.SchedParam{Sched_priority: 90})Caution: Real-time scheduling can starve other processes. Use only in controlled environments.
cgroups v2 CPU Bandwidth Control
Limit CPU usage while maintaining isolation:
# Create cgroup
mkdir -p /sys/fs/cgroup/myapp
# Set 50% CPU usage limit across 4 CPUs
echo "200000" > /sys/fs/cgroup/myapp/cpu.max # max usec per 100ms period
echo "100000" > /sys/fs/cgroup/myapp/cpu.idle
# Move process
echo <PID> > /sys/fs/cgroup/myapp/cgroup.procsisolcpus for Dedicated Cores
Reserve CPU cores exclusively for latency-sensitive workloads:
# Boot parameter: isolcpus=4-7
# Results in CPUs 4-7 excluded from normal scheduling
# Then pin application to these cores:
taskset -c 4-7 ./myappDisk I/O Tuning
I/O Scheduler Selection
Choose scheduler based on workload:
# Check current scheduler
cat /sys/block/sda/queue/scheduler
# Available schedulers:
# none: just FIFO queue (best for NVMe/SSD)
# noop: no reordering (alternative to none)
# bfq: budget fair queuing (good for mixed workloads)
# mq-deadline: prioritize meeting deadlines
# For SSD (usually best):
echo none | sudo tee /sys/block/nvme0n1/queue/scheduler
# For HDD:
echo mq-deadline | sudo tee /sys/block/sda/queue/schedulerRead-Ahead Tuning
Kernel read-ahead can help sequential I/O but hurts random workloads:
# Check read-ahead (in 512-byte blocks)
sudo blockdev --getra /dev/sda
# Disable read-ahead for random workloads
sudo blockdev --setra 0 /dev/sda
# Enable for sequential workloads (typically 256-512 blocks)
sudo blockdev --setra 512 /dev/sdaDirect I/O (O_DIRECT)
Bypass page cache for explicit I/O control:
import (
"os"
"syscall"
)
func openDirect(filename string) (*os.File, error) {
// O_DIRECT flag
return os.OpenFile(filename, os.O_RDONLY|syscall.O_DIRECT, 0)
}
// Trade-off: bypasses kernel caching (faster for random I/O) but
// requires 4KB-aligned buffers and manual alignmentSecurity and Performance
Security mitigations have measurable performance costs:
Spectre/Meltdown Mitigations
# Check mitigation status
cat /proc/cpuinfo | grep bugs
# Typical output:
# bugs : cpu_meltdown spectre_v1 spectre_v2
# Estimate cost:
# - Kernel KPTI (Meltdown): 5-7% syscall overhead
# - IBRS (Spectre v2): 10-20% conditional branch overhead
# - No easy way to disable (requires BIOS change)seccomp Overhead
# seccomp filtering adds ~5-10% overhead per syscall
# Use only for containers/sandboxes where neededAppArmor/SELinux Impact
# Measure SELinux impact
sudo semodule -d
# Run benchmark, then compare
# Typical cost: 5-15% for heavy syscall workloads
# Minimal impact for CPU-bound codeMonitoring OS-Level Metrics
/proc Filesystem
import (
"bufio"
"os"
"strconv"
"strings"
)
func readProcStat(pid int) (map[string]int64, error) {
file, _ := os.Open(fmt.Sprintf("/proc/%d/stat", pid))
defer file.Close()
scanner := bufio.NewScanner(file)
scanner.Scan()
// Fields: pid, comm, state, ppid, pgrp, session, tty_nr, tpgid, flags,
// minflt, cminflt, majflt, cmajflt, utime, stime, cutime, cstime, priority,
// nice, num_threads, itrealvalue, starttime, vsize, rss, ...
fields := strings.Fields(scanner.Text())
stats := make(map[string]int64)
stats["utime"], _ = strconv.ParseInt(fields[13], 10, 64)
stats["stime"], _ = strconv.ParseInt(fields[14], 10, 64)
stats["num_threads"], _ = strconv.ParseInt(fields[17], 10, 64)
stats["vsize"], _ = strconv.ParseInt(fields[22], 10, 64)
stats["rss"], _ = strconv.ParseInt(fields[23], 10, 64)
return stats, nil
}
func countOpenFD(pid int) (int, error) {
entries, err := os.ReadDir(fmt.Sprintf("/proc/%d/fd", pid))
return len(entries), err
}Hardware Performance Counters (perf)
# Count cache misses and branch mispredictions
perf stat -e cache-references,cache-misses,branch-load-misses ./myapp
# Output:
# Performance counter stats for './myapp':
# 12,345,678 cache-references
# 1,234,567 cache-misses # 10% miss rate
# 123,456 branch-load-misses
# Record traces for analysis
perf record -F 99 -g ./myapp
perf reporteBPF Runtime Analysis
# Example: Trace syscall latency
# Requires eBPF understanding; tools like bpftrace simplify this
# Install bpftrace
apt-get install bpftrace
# Trace syscall latencies
bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @start[tid] = nsecs; }
tracepoint:raw_syscalls:sys_exit {
if (@start[tid]) {
@latency = hist(nsecs - @start[tid]);
delete(@start[tid]);
}
}'Complete sysctl Configuration Examples
High-Throughput API Server
# /etc/sysctl.d/99-go-api-server.conf
# File descriptors
fs.file-max=2097152
# Network stack
net.core.somaxconn=4096
net.ipv4.tcp_max_syn_backlog=8192
net.core.netdev_max_backlog=4000
net.ipv4.tcp_tw_reuse=1
net.ipv4.tcp_fin_timeout=20
# TCP buffers for high throughput
net.core.rmem_max=134217728
net.core.wmem_max=134217728
net.ipv4.tcp_rmem=4096 87380 67108864
net.ipv4.tcp_wmem=4096 65536 67108864
# TCP keepalive
net.ipv4.tcp_keepalive_time=300
net.ipv4.tcp_keepalive_intvl=60
net.ipv4.tcp_keepalive_probes=5
# Memory
vm.overcommit_memory=2
vm.swappiness=10
vm.max_map_count=262144Apply with:
sudo sysctl -p /etc/sysctl.d/99-go-api-server.confData Pipeline / Batch Processing
# /etc/sysctl.d/99-go-batch.conf
# Optimize for throughput, not latency
fs.file-max=2097152
# Less aggressive TCP tuning
net.core.somaxconn=1024
net.ipv4.tcp_max_syn_backlog=2048
# Allow swap for handling large datasets
vm.overcommit_memory=1
vm.swappiness=60
# Larger buffers for batch I/O
net.core.rmem_max=268435456
net.core.wmem_max=268435456
# I/O scheduler: none for NVMe, bfq for HDD
# (Set via blockdev, not sysctl)Proxy / Load Balancer
# /etc/sysctl.d/99-go-proxy.conf
# Handle many connections
fs.file-max=2097152
net.core.somaxconn=8192
net.ipv4.tcp_max_syn_backlog=8192
# Aggressive TIME_WAIT reuse
net.ipv4.tcp_tw_reuse=1
net.ipv4.tcp_fin_timeout=15
# Moderate buffer sizes
net.core.rmem_max=16777216
net.core.wmem_max=16777216
# Enable TCP_NODELAY by default (Go does this)
# Memory: balance between caching and swap avoidance
vm.swappiness=20Benchmarking OS Tuning Impact
package main
import (
"fmt"
"net"
"sync"
"sync/atomic"
"testing"
"time"
)
func BenchmarkConnectionEstablishment(b *testing.B) {
ln, _ := net.Listen("tcp", "127.0.0.1:0")
defer ln.Close()
go func() {
for {
conn, _ := ln.Accept()
conn.Close()
}
}()
b.ResetTimer()
for i := 0; i < b.N; i++ {
conn, _ := net.Dial("tcp", ln.Addr().String())
conn.Close()
}
}
// Results with/without tuning:
// Default kernel: ~10μs per connection
// Tuned kernel: ~2-3μs per connection (3-5x improvement)
func BenchmarkThroughput(b *testing.B) {
ln, _ := net.Listen("tcp", "127.0.0.1:0")
defer ln.Close()
var bytesServed int64
for i := 0; i < 4; i++ {
go func() {
for {
conn, _ := ln.Accept()
io.Copy(io.Discard, conn)
conn.Close()
}
}()
}
var wg sync.WaitGroup
for i := 0; i < 100; i++ {
wg.Add(1)
go func() {
defer wg.Done()
conn, _ := net.Dial("tcp", ln.Addr().String())
data := make([]byte, 1024)
for j := 0; j < b.N/100; j++ {
conn.Write(data)
atomic.AddInt64(&bytesServed, int64(len(data)))
}
conn.Close()
}()
}
b.ResetTimer()
wg.Wait()
throughput := float64(atomic.LoadInt64(&bytesServed)) / b.Elapsed().Seconds()
fmt.Printf("Throughput: %.2f Gbps\n", throughput*8/1e9)
}OS-level tuning is not optional for high-performance Go services. Proper configuration can improve throughput by 2-5x and latency by 50-90%, sometimes even enabling 10x improvements for pathological cases. Always profile and measure impact on your specific workload.