Go Performance Guide
Go Internals

Syscall and OS Integration

How Go handles system calls, the difference between syscall and rawSyscall, M parking during blocking calls, the netpoller for async I/O, and CGO threading implications.

Introduction

The Go runtime bridges the gap between your high-level goroutines and the operating system through a sophisticated syscall integration layer. Understanding how Go makes system calls, how it prevents thread exhaustion, and how it implements asynchronous I/O is critical for writing performant concurrent programs.

In this article, we'll dissect:

  • Direct syscall mechanisms (avoiding libc)
  • The distinction between syscall.Syscall and syscall.RawSyscall
  • The "M handoff" mechanism for non-blocking goroutine scheduling
  • The netpoller architecture for async network I/O
  • CGO threading implications
  • Performance considerations and optimization strategies

How Go Makes System Calls

Unlike languages like Python or Ruby that route syscalls through libc, Go typically makes system calls directly using syscall instructions, bypassing the C standard library entirely (except on macOS/iOS where dynamic linking requirements force libc usage).

Direct vs. Libc Syscalls

Direct syscalls (most platforms):

  • Go emits syscall instructions (e.g., SYSCALL on x86-64, SVC on ARM64)
  • Bypasses libc overhead
  • Syscalls are functions like syscall.Open(), syscall.Read(), syscall.Write()
  • Architecture-specific code in runtime/sys_*.go and syscall/zsyscall_*.go

Example architecture breakdown:

User Code

syscall.Open() [Go implementation]

func open(path string, mode int, perm uint32) (fd int, err error) {
    r0, _, e1 := syscall.Syscall(syscall.SYS_OPEN, uintptr(unsafe.Pointer(...)...))
    // syscall.Syscall() calls entersyscall/exitsyscall
}

go/src/runtime/sys_linux_amd64.s [assembly]
    MOVQ    $syscall.SYS_OPEN, AX   ; syscall number
    SYSCALL                          ; CPU instruction

OS Kernel

File system layer

libc syscalls (macOS/iOS):

User Code

syscall.Open()

cgo bridge → libc open() → SYSCALL

Why Not Always Use Libc?

  1. Overhead: Function call, PLT relocation, library initialization
  2. Compatibility: Direct syscalls don't depend on glibc version
  3. Performance: Measurable difference in tight loops with many syscalls
  4. Control: Go can implement custom error handling and path strategies

Syscall vs. RawSyscall: The Critical Difference

The syscall package exposes two fundamental syscall entry points, and the difference is crucial for Go's scheduling model.

syscall.Syscall (with scheduler notification)

// From src/syscall/syscall.go
func Syscall(trap, a1, a2, a3 uintptr) (r1, r2 uintptr, err Errno)

// Pseudo-code behavior:
// 1. Call runtime.entersyscall()      [notify scheduler]
// 2. Execute the actual syscall
// 3. Call runtime.exitsyscall()       [scheduler recovery logic]

What entersyscall does:

entersyscall() {
    g := getg()                // Get current goroutine
    g.m.locks++                // Prevent preemption
    g.syscallsp = getcallersp()
    g.syscallpc = getcallerpc()

    atomic.Store(&g.atomicstatus, Gsyscall)  // Mark G as in syscall

    if atomic.Load(&sched.gcwaiting) != 0 {
        // GC is waiting; let's give up our P
        atomic.Xchg(&g.m.p.ptr().status, Psyscall)
        handoffp(releasep())     // Hand P off
    }

    g.m.syscalltick = g.m.p.ptr().syscalltick
    g.m.locks--
}

Key implications:

  • The P (processor) is detached from the M (OS thread)
  • The P can be given to another M to run other goroutines
  • The current M can now block without stalling other goroutines

What exitsyscall does:

exitsyscall() {
    g := getg()

    oldp := g.m.oldp.ptr()
    if oldp != nil && atomic.Load(&oldp.status) == Psyscall &&
       atomic.Cas(&oldp.status, Psyscall, Prunning) {
        // Successfully reacquired P
        g.m.p.set(oldp)
        atomic.Store(&g.atomicstatus, Grunning)
        g.syscallsp = 0
        return
    }

    // Could not reacquire P; goroutine must wait
    mcall(exitsyscallSlow)  // Park M, put G in global queue
}

exitsyscallSlow() {
    // Put G in global run queue
    // Park M in idle queue
    // Schedule() will pick up work when M is unparked
}

rawSyscall (no scheduler notification)

// From src/syscall/syscall.go
func RawSyscall(trap, a1, a2, a3 uintptr) (r1, r2 uintptr, err Errno)

// Pseudo-code behavior:
// 1. Execute the syscall directly
// 2. NO entersyscall/exitsyscall

When to use RawSyscall:

  • Very fast syscalls that never block (e.g., getpid(), gettimeofday())
  • Syscalls from syscall.Syscall would cause unnecessary overhead
  • The cost of entersyscall/exitsyscall exceeds the syscall latency

Never use RawSyscall for blocking operations like read(), write(), or accept() — this blocks the entire OS thread and prevents the scheduler from running other goroutines.

Benchmark: Syscall vs. RawSyscall

package main

import (
    "syscall"
    "testing"
)

// Benchmark reading from /dev/null (fast, non-blocking path)
func BenchmarkSyscall(b *testing.B) {
    fd, _ := syscall.Open("/dev/null", syscall.O_RDONLY, 0)
    defer syscall.Close(fd)

    buf := make([]byte, 1)
    b.ResetTimer()

    for i := 0; i < b.N; i++ {
        syscall.Read(fd, buf)
    }
}

// Raw syscall (unsafe if read blocks, but fast for /dev/null)
func BenchmarkRawSyscall(b *testing.B) {
    fd, _, _ := syscall.RawSyscall(
        syscall.SYS_OPEN,
        uintptr(unsafe.Pointer(syscall.StringBytePtr("/dev/null"))),
        uintptr(syscall.O_RDONLY),
        0,
    )
    defer syscall.RawSyscall(syscall.SYS_CLOSE, fd, 0, 0)

    buf := make([]byte, 1)
    b.ResetTimer()

    for i := 0; i < b.N; i++ {
        syscall.RawSyscall(
            syscall.SYS_READ,
            fd,
            uintptr(unsafe.Pointer(&buf[0])),
            1,
        )
    }
}

// Expected output on Linux x86-64:
// BenchmarkSyscall-12          2000000    500 ns/op
// BenchmarkRawSyscall-12       3000000    350 ns/op

The M Handoff Mechanism

The cornerstone of Go's scalability is the M (OS thread) handoff: when an M blocks in a syscall, its P is handed to another M (or a new one is created), allowing other goroutines to run.

Visual Overview

Before Syscall:
┌────────────────┬────────────────┬────────────────┐
│    M1 (busy)   │    M2 (idle)   │    M3 (idle)   │
├────────────────┼────────────────┼────────────────┤
│    P1          │    idle        │    idle        │
├────────────────┼────────────────┼────────────────┤
│ G1, G2, G3     │      -         │      -         │
└────────────────┴────────────────┴────────────────┘

M1 runs entersyscall() for G1:
┌────────────────┬────────────────┬────────────────┐
│  M1 (blocked)  │    M2 (busy)   │    M3 (idle)   │
├────────────────┼────────────────┼────────────────┤
│     -          │    P1          │    idle        │
├────────────────┼────────────────┼────────────────┤
│ G1 (Gsyscall)  │ G2, G3, ... G4 │      -         │
└────────────────┴────────────────┴────────────────┘

  [blocked on syscall for G1]

  [P1 handed off to M2]

  [M2 runs other goroutines]

Handoff Algorithm

// Simplified from runtime/proc.go

func entersyscall_handoff(gp *g) {
    // Check if we should hand off P
    if atomic.Load(&sched.gcwaiting) != 0 {
        // GC needs us to give up P
        mp := acquirem()  // Acquire current M
        mp.p.ptr().status = Psyscall
        handoffp(releasep())  // Hand off P to another M
        releasem(mp)
    }
}

func handoffp(pp *p) {
    // Try to give P to an idle M
    if n := sched.pidle.len(); n > 0 {
        mp := findrunnable()  // Find an idle M
        startm(mp, pp)        // Start M with P
        return
    }

    // No idle M; create a new one
    if newmcount() < gomaxprocs*10 {  // Don't exceed limit
        newm(nil, pp)  // Create new M with P
        return
    }

    // All else fails; put P in idle queue
    pidleput(pp)
}

M Limit: runtime.SetMaxProcs and gomaxprocs

Go limits the maximum number of OS threads created by the runtime via runtime/debug.SetMaxThreads():

// Default: 10,000 OS threads
var MaxThreads int = 10000

func SetMaxThreads(n int) int {
    return runtime_setMaxStack(n)
}

Why this limit?

  • OS overhead per thread: ~2 MB stack allocation (Linux)
  • Context switch overhead scales with thread count
  • Runaway thread creation → resource exhaustion
  • Unbounded syscall count could create one thread per syscall

Note: This is independent of GOMAXPROCS, which controls the P count (logical CPUs).


The Netpoller: Asynchronous I/O

While blocking syscalls like read() and write() on regular files must block an M, Go provides asynchronous I/O for network operations through the netpoller.

Why Netpoller?

Network I/O is fundamentally different from file I/O:

  • Network: Packets arrive asynchronously; multiplexing is natural (epoll/kqueue)
  • Files: Sequential access model; kernel doesn't provide efficient async I/O (AIO is complex)

The netpoller allows thousands of goroutines to wait on I/O without consuming OS threads.

Architecture: epoll on Linux

Go User Code:

conn.Read(buf)  [net.Conn interface]

(*netFD).Read()  [Internal file descriptor wrapper]

pollDesc.waitRead()  [Register for read interest]

netpoll() [Check for ready I/O, park goroutine if needed]

epoll_wait(epfd, events, maxevents, timeout)  [Linux syscall]

Kernel [epoll multiplexer]

[When packet arrives, event generated]

Ready events returned to netpoll()

Goroutines unparked, resumed

pollDesc Structure

// From src/runtime/netpoll.go
type pollDesc struct {
    link *pollDesc           // Linked list of poll descriptors
    fd   uintptr             // OS file descriptor

    // Goroutines waiting on this fd
    rg   uintptr             // G waiting for read; 0 if none
    wg   uintptr             // G waiting for write; 0 if none

    // Deadline handling
    rt   timer               // Read deadline timer
    wt   timer               // Write deadline timer

    // Poll state
    user  uint32             // User-settable data (opaque)
    rseq  uint32             // Read sequence number
    wseq  uint32             // Write sequence number
}

type netpollEvent struct {
    fd    uintptr            // Which fd became ready
    pd    *pollDesc          // Descriptor for that fd
    mask  uint32             // POLLIN | POLLOUT
}

netpoll() Integration with Scheduler

The netpoller is invoked during findrunnable() when the scheduler has no work:

// From src/runtime/proc.go, simplified

func findrunnable() (gp *g, inheritTime bool) {
    // Try to find work locally
    if gp := runqget(pp); gp != nil {
        return gp, false
    }

    // Check netpoller for ready network I/O
    if netpolled := netpoll(0); netpolled != nil {
        injectglist(netpolled)
        return netpolled, false
    }

    // Check work-stealing from other Ps
    // ...

    // If still nothing, park M in idle queue
    // ...
}

func netpoll(delay int64) *g {
    // timeout = delay
    // Wait for events
    n := epoll_wait(epfd, events, 128, timeout)

    var gp *g
    for i := 0; i < n; i++ {
        pd := events[i].pollDesc

        // Unpark goroutines waiting on this fd
        if events[i].mask&(POLLIN|POLLHUP|POLLERR) != 0 {
            if rg := atomic.LoadUintptr(&pd.rg); rg != 0 {
                gp = casgstatus((*g)(unsafe.Pointer(rg)), Gwaiting, Grunnable)
                // gp will be added to run queue
            }
        }
        // Similarly for write readiness
    }

    return gp
}

Deadline/Timeout Mechanism

Go timers and context deadlines integrate with netpoller:

// User code with timeout
ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
defer cancel()

// This calls conn.SetDeadline() internally
conn.Read(buf)

Internally:

func (fd *netFD) SetReadDeadline(t time.Time) error {
    // Schedule a timer that will interrupt the read
    runtime_pollSetDeadline(fd.pd, t, 'r')
}

// Timer implementation
func runtime_pollSetDeadline(pd *pollDesc, t time.Time, mode byte) {
    var d int64
    if !t.IsZero() {
        d = t.UnixNano()
    }

    // If deadline passes, the timer unparks the goroutine with an error
    // The goroutine resumes and returns a context.DeadlineExceeded error
}

Netpoller Platforms

Platform          Multiplexer       Maximum FDs
────────────────────────────────────────────────
Linux             epoll             unlimited
macOS/BSD         kqueue            unlimited
Windows           IOCP (I/O Completion Ports)
Plan 9            Custom
Older systems     select()          FD_SETSIZE (~1024)

File I/O: Why It's Blocking

File I/O is NOT async — there's no netpoller for regular files. This is a fundamental OS limitation:

  • POSIX aio_read() / aio_write() exist but are rarely used
  • Linux io_uring is newer (Go doesn't use it yet)
  • Most syscalls on regular files complete quickly

When you call os.Open() or ioutil.ReadFile():

func ReadFile(filename string) ([]byte, error) {
    f, err := os.Open(filename)  // Syscall, blocks M
    if err != nil {
        return nil, err
    }
    defer f.Close()  // Also a syscall, blocks M

    b, err := io.ReadAll(f)      // read() syscalls, block M
    return b, err
}

Each read() syscall blocks the OS thread. If you have thousands of goroutines doing file I/O concurrently, you'll create thousands of OS threads.

Workaround: Goroutine Pool for File I/O

package main

import (
    "fmt"
    "io/ioutil"
    "sync"
)

type FileIOPool struct {
    workers int
    tasks   chan FileTask
    wg      sync.WaitGroup
}

type FileTask struct {
    path   string
    result chan []byte
    err    chan error
}

func NewFileIOPool(workers int) *FileIOPool {
    p := &FileIOPool{
        workers: workers,
        tasks:   make(chan FileTask, 100),
    }

    for i := 0; i < workers; i++ {
        p.wg.Add(1)
        go p.worker()
    }

    return p
}

func (p *FileIOPool) worker() {
    defer p.wg.Done()

    for task := range p.tasks {
        data, err := ioutil.ReadFile(task.path)
        if err != nil {
            task.err <- err
        } else {
            task.result <- data
        }
    }
}

func (p *FileIOPool) ReadFile(path string) ([]byte, error) {
    result := make(chan []byte)
    errChan := make(chan error)

    p.tasks <- FileTask{path, result, errChan}

    select {
    case data := <-result:
        return data, nil
    case err := <-errChan:
        return nil, err
    }
}

// Usage:
// pool := NewFileIOPool(10)  // 10 dedicated threads for file I/O
// data, _ := pool.ReadFile("/path/to/file")

This pattern bounds the number of OS threads doing file I/O.


CGO and Threading Implications

When you call C code from Go via cgo, the call is treated similarly to a syscall:

entersyscall for CGO

//export GoFunction
func GoFunction(x int) int {
    // CGO call marshaling triggers:
    // 1. entersyscall()  [P is detached]
    // 2. C function call
    // 3. exitsyscall()   [P reacquired or M parked]

    return x * 2
}

func main() {
    // Call C function from Go
    result := C.some_c_function(42)  // entersyscall/exitsyscall
}

Thread Affinity: LockOSThread

Some C libraries require calls on a specific OS thread (e.g., OpenGL, some database drivers). Use runtime.LockOSThread():

package main

import (
    "C"
    "runtime"
)

func InitOpenGL() {
    runtime.LockOSThread()
    // C.glInit()  // Must run on same thread
    // All GL calls must be from this goroutine
}

func main() {
    go InitOpenGL()
    // This goroutine is permanently bound to its M
}

Cost: That M can never run other goroutines. For N thread-affine goroutines, N OS threads are needed.

C Calling Back to Go

When C calls back into Go, the runtime creates a new M/G:

// C code calls back into Go
extern int go_handler(int x);

void c_library_init(int (*callback)(int)) {
    // Later...
    int result = callback(42);  // Calls Go function
}
//export go_handler
func go_handler(x int) int {
    return x * 2
}

func main() {
    C.c_library_init(C.go_handler)
}

When callback is invoked from C:

  1. A new M (cgo thread) is activated
  2. A new G is created on that M
  3. go_handler executes
  4. The G is parked until next C callback

Signal Handling

Go manages OS signals through a dedicated signal stack and sigtramp trampoline:

User Code

Signal arrives (SIGTERM, SIGINT, etc.)

Kernel [Signal handling]

sigtramp (assembly) [runtime/os_linux.s]

Signal handler (Go function)

signal.Notify channels

User signal receivers

Key Points

// Register signal handler
sigs := make(chan os.Signal, 1)
signal.Notify(sigs, syscall.SIGTERM, syscall.SIGINT)

go func() {
    sig := <-sigs  // Blocks until signal
    fmt.Println("Received signal:", sig)
}()

Signals are asynchronous but delivered through goroutines, maintaining Go's concurrency model.


Performance Implications and Optimization Tips

1. Many Blocking Syscalls = Thread Explosion

// Bad: Creates many threads
func FetchManyUrls(urls []string) {
    var wg sync.WaitGroup
    for _, url := range urls {
        wg.Add(1)
        go func(u string) {
            defer wg.Done()
            resp, _ := http.Get(u)  // May block in many ways
            // ...
        }(url)
    }
    wg.Wait()
}

// Better: Use connection pooling
client := &http.Client{
    Transport: &http.Transport{
        MaxIdleConns: 100,
        MaxIdleConnsPerHost: 10,
    },
}
// Now syscalls are multiplexed over fewer threads

2. Prefer net Package Over Raw Sockets

// Avoid (blocks on syscall)
fd, _ := syscall.Socket(syscall.AF_INET, syscall.SOCK_STREAM, 0)
syscall.Connect(fd, &sa)
syscall.Read(fd, buf)

// Prefer (integrated with netpoller)
conn, _ := net.Dial("tcp", "example.com:80")
conn.Read(buf)

The net package wraps syscalls with netpoller integration.

3. Batch Syscalls

// Bad: N syscalls
for _, path := range paths {
    os.Stat(path)  // syscall per file
}

// Better: Use syscall batching (where applicable)
// Example: statx on Linux can batch operations

4. Be Cautious with CGO in Hot Paths

// Avoid in tight loops
func ProcessMany(items []Item) {
    for _, item := range items {
        C.process_item(item)  // entersyscall per item
    }
}

// Better: Batch items
func ProcessBatch(items []Item) {
    // Copy items once, process in C
    C.process_many_items(...)
}

5. Monitor Thread Count

package main

import (
    "fmt"
    "runtime"
    "time"
)

func main() {
    ticker := time.NewTicker(1 * time.Second)
    for range ticker.C {
        var m runtime.MemStats
        runtime.ReadMemStats(&m)
        fmt.Printf("Goroutines: %d, M threads: %d\n",
            runtime.NumGoroutine(),
            runtime.NumThread(),  // Number of OS threads
        )
    }
}

Benchmarks: Syscall Overhead

package main

import (
    "net"
    "testing"
)

// Benchmark: Writing to network socket (with netpoller)
func BenchmarkNetWrite(b *testing.B) {
    ln, _ := net.Listen("tcp", "127.0.0.1:0")
    defer ln.Close()

    go func() {
        for {
            conn, _ := ln.Accept()
            go io.Copy(io.Discard, conn)
        }
    }()

    conn, _ := net.Dial("tcp", ln.Addr().String())
    defer conn.Close()

    data := []byte("Hello")
    b.ResetTimer()

    for i := 0; i < b.N; i++ {
        conn.Write(data)
    }
}

// Expected: ~200 ns/op (optimized, batched writes)

// Benchmark: Writing to file (blocks M)
func BenchmarkFileWrite(b *testing.B) {
    f, _ := os.Create("/tmp/test.txt")
    defer f.Close()

    data := []byte("Hello")
    b.ResetTimer()

    for i := 0; i < b.N; i++ {
        f.Write(data)
    }
}

// Expected: ~500 ns/op (additional M parking overhead)

Summary

Go's syscall integration is a marvel of engineering:

  1. Direct syscalls bypass libc overhead
  2. entersyscall/exitsyscall allows goroutines to block without stalling others via M handoff
  3. The netpoller enables thousands of concurrent network connections on a few OS threads
  4. File I/O remains blocking (use worker pools for heavy concurrent file access)
  5. CGO requires careful threading consideration (LockOSThread for thread-affine code)
  6. Signals are delivered asynchronously through goroutines

Understanding these mechanisms helps you write scalable, efficient concurrent Go programs.

On this page