CGO Performance
Understand CGO overhead, when to use C libraries, and strategies for minimizing the performance cost of crossing language boundaries.
What is CGO?
CGO is Go's foreign function interface (FFI) for calling C code and vice versa. It enables integrating high-performance C libraries directly into Go applications.
package main
import "C"
//export AddNumbers
func AddNumbers(a C.int, b C.int) C.int {
return a + b
}
func main() {
result := C.addFromC(5, 3)
}While CGO is powerful, it carries a significant performance cost that must be understood when designing your application's architecture.
The Overhead Problem: CGO Call Cost
A CGO call is dramatically slower than a native Go function call.
Typical costs:
- Pure Go function call: 1-5 nanoseconds
- CGO call: 100-200 nanoseconds
This is a 20-100x slowdown for the call itself, before any computation happens.
Why is CGO So Slow?
The overhead comes from several sources:
1. Goroutine to OS Thread Transition
Go's goroutines are lightweight user-space threads, while C functions require OS threads. CGO must transition from a goroutine to an OS thread:
Goroutine → Lock M (machine) →
Transition to OS thread → Execute C code →
Transition back to Goroutine → Unlock MThis involves context switching and synchronization overhead.
2. Stack Switching
Go goroutines use segmented stacks (can grow), while C uses fixed stacks. CGO must switch between them:
func callC() {
// Save goroutine stack state
// Switch to OS thread's C stack (fixed size)
// Call C function
// Switch back to goroutine stack
// Restore goroutine state
}3. Signal Mask Manipulation
Go uses signals for runtime functions (GC, goroutine scheduling). Before calling C code, Go must:
- Save the signal mask
- Reset to a safe mask
- Restore after C returns
4. Thread Pinning and Mutex Operations
The calling goroutine is pinned to an OS thread during the CGO call. This involves lock acquisition and release, adding synchronization overhead.
5. Type Conversion and Memory Marshaling
Arguments and return values must be converted from Go types to C types:
// Each of these causes memory allocation and copying
s := "hello"
cStr := C.CString(s) // Allocates C memory, copies string
defer C.free(unsafe.Pointer(cStr))
// Passing Go slice to C requires conversion
goSlice := []int64{1, 2, 3}
cArray := (*C.long_long)(unsafe.Pointer(&goSlice[0])) // Unsafe conversionBenchmark: CGO Overhead
Let's measure the actual cost:
package main
import (
"C"
"testing"
"unsafe"
)
//export Add
func Add(a, b C.int) C.int {
return a + b
}
func addGo(a, b int) int {
return a + b
}
func BenchmarkCGOCall(b *testing.B) {
for i := 0; i < b.N; i++ {
C.Add(C.int(i), C.int(i+1))
}
}
func BenchmarkGoCall(b *testing.B) {
for i := 0; i < b.N; i++ {
addGo(i, i+1)
}
}
// More realistic: converting strings
func BenchmarkCGOStringConversion(b *testing.B) {
input := "hello world"
b.ResetTimer()
for i := 0; i < b.N; i++ {
cStr := C.CString(input)
// Simulate C function call
_ = cStr
C.free(unsafe.Pointer(cStr))
}
}
func BenchmarkGoStringUsage(b *testing.B) {
input := "hello world"
b.ResetTimer()
for i := 0; i < b.N; i++ {
// Native Go string operations
_ = len(input)
_ = input[0]
}
}Expected Results
BenchmarkCGOCall-8 10000000 125 ns/op 0 B/op 0 allocs/op
BenchmarkGoCall-8 1000000000 0.5 ns/op 0 B/op 0 allocs/op
BenchmarkCGOStringConversion-8 2000000 650 ns/op 16 B/op 1 allocs/op
BenchmarkGoStringUsage-8 1000000000 0.8 ns/op 0 B/op 0 allocs/opThe CGO overhead dominates - pure call cost is 250x, with memory conversion it's 800x!
When CGO is Worth the Cost
Despite the overhead, CGO is valuable in specific scenarios:
1. Leveraging Highly Optimized C Libraries
Some libraries are so efficient that the CGO overhead is negligible compared to the computation:
OpenSSL for Cryptography:
func encryptAES(key, plaintext []byte) ([]byte, error) {
// OpenSSL call takes 1-5 microseconds
// CGO overhead: 0.125 microseconds
// Total: ~1% overhead
// Using pure Go crypto: slower algorithms
}When the C function takes 1000+ nanoseconds, the 125ns CGO overhead is only 10% of execution time.
2. SQLite Database Engine
SQLite is a compact, well-tested database. Using it via CGO is often faster than pure Go implementations:
// Using CGO sqlite3
import "github.com/mattn/go-sqlite3"
db, _ := sql.Open("sqlite3", ":memory:")
db.Exec("INSERT INTO users VALUES (?)", 1) // ~10 microseconds
// Using pure Go sqlite (modernc.org/sqlite)
db, _ := sql.Open("sqlite", "file:memdb")
db.Exec("INSERT INTO users VALUES (?)", 1) // ~50 microsecondsThe CGO overhead is 1 microsecond out of 10 total, but the C library is fundamentally faster.
3. Image Processing and Codecs
libjpeg for JPEG decoding:
// C libjpeg: 10-50 milliseconds for large image
// Pure Go JPEG: 50-200 milliseconds
// CGO overhead: 0.125 milliseconds (0.25% of time)The computation time dominates, making CGO overhead negligible.
4. Signal Processing and SIMD
C libraries with SIMD optimizations can provide 4-8x speedup:
// C library with AVX-512: 100 ns per operation
// Pure Go: 400 ns per operation
// CGO overhead: 125 ns (20% cost, 4x speedup overall)When to Avoid CGO
1. Thin Wrappers Around Simple Operations
// ✗ Bad - CGO overhead exceeds computation
//export GoMin
func GoMin(a, b C.int) C.int {
if a < b {
return a
}
return b
}
// Calling this 1 billion times:
// 125 ns CGO × 1B = 125 seconds just calling!
// Pure Go: 0.5 ns × 1B = 0.5 seconds2. Calling CGO in Hot Loops
// ✗ Bad
func processMillions(data []int) {
for _, v := range data {
result := C.ProcessValue(C.int(v)) // 1 billion CGO calls!
_ = result
}
}
// ✓ Good
func processMillions(data []int) {
results := make([]C.int, len(data))
for i, v := range data {
results[i] = C.int(v)
}
C.ProcessArray((*C.int)(unsafe.Pointer(&results[0])), C.int(len(results)))
}3. Type Conversions in Critical Paths
// ✗ Bad - allocation for each call
for i := 0; i < 1000000; i++ {
s := fmt.Sprintf("item_%d", i)
cStr := C.CString(s)
C.ProcessString(cStr) // String allocation + CGO call!
C.free(unsafe.Pointer(cStr))
}
// ✓ Better - allocate once
cStrs := make([]*C.char, 1000000)
for i := 0; i < 1000000; i++ {
s := fmt.Sprintf("item_%d", i)
cStrs[i] = C.CString(s)
}
C.ProcessStringArray((**C.char)(unsafe.Pointer(&cStrs[0])), C.int(len(cStrs)))
for _, cStr := range cStrs {
C.free(unsafe.Pointer(cStr))
}Strategy 1: Batching CGO Calls
The key to minimizing CGO overhead is amortizing the cost across multiple operations.
Example: Processing Data in Batches
// ✗ Poor - 1M CGO calls for 1M elements
for _, item := range items {
C.ProcessItem((*C.char)(unsafe.Pointer(unsafe.StringData(item))))
}
// ✓ Better - 1 CGO call for 1M elements
batch := make([]*C.char, 0, len(items))
for _, item := range items {
batch = append(batch, C.CString(item))
}
defer func() {
for _, p := range batch {
C.free(unsafe.Pointer(p))
}
}()
C.ProcessBatch((**C.char)(unsafe.Pointer(&batch[0])), C.int(len(batch)))Benchmark: Batching vs Per-Item
func BenchmarkUnbatched(b *testing.B) {
items := make([]string, 1000)
for i := range items {
items[i] = fmt.Sprintf("item_%d", i)
}
b.ResetTimer()
for i := 0; i < b.N; i++ {
for _, item := range items {
C.ProcessString(C.CString(item))
}
}
}
func BenchmarkBatched(b *testing.B) {
items := make([]string, 1000)
for i := range items {
items[i] = fmt.Sprintf("item_%d", i)
}
b.ResetTimer()
for i := 0; i < b.N; i++ {
batch := make([]*C.char, len(items))
for j, item := range items {
batch[j] = C.CString(item)
}
C.ProcessBatch((**C.char)(unsafe.Pointer(&batch[0])), C.int(len(items)))
for _, p := range batch {
C.free(unsafe.Pointer(p))
}
}
}Results show batching reduces per-item cost from 2000 ns to 200 ns (10x improvement).
Strategy 2: Memory Management in CGO
Proper memory management is critical for both safety and performance.
Allocating Memory
// Go memory
goMem := make([]byte, 1024)
// C memory
cMem := C.malloc(1024)
defer C.free(cMem)
// Passing Go memory to C requires careful handling
// Go slices are managed by GC, C memory is notThe Cgo Pointer Rules
Go enforces strict pointer rules to prevent memory corruption:
- Go code may pass Go pointers to C
- C code may not keep Go pointers after function returns
- C code may not store Go pointers in C memory
// ✓ Safe - C doesn't keep the pointer
var goArray [100]int
C.ProcessArray((*C.int)(unsafe.Pointer(&goArray[0])))
// ✗ Unsafe - C stores Go pointer
var ptr *unsafe.Pointer
C.StorePointer(ptr) // C code stores goPtr, panic!Avoiding Unnecessary Allocations
// ✗ Allocates a new C string every time
func process(s string) error {
cStr := C.CString(s)
defer C.free(unsafe.Pointer(cStr))
return checkError(C.DoSomething(cStr))
}
// ✓ Reuse buffer if calling multiple times
func processMany(strings []string) error {
for _, s := range strings {
cStr := C.CString(s)
if err := checkError(C.DoSomething(cStr)); err != nil {
C.free(unsafe.Pointer(cStr))
return err
}
C.free(unsafe.Pointer(cStr))
}
return nil
}Strategy 3: Build Configuration and Cross-Compilation
CGO comes with significant build complexity.
Build Time Overhead
# Pure Go - fast
go build
# 0.5 seconds
# CGO - slow
go build # Requires C compiler
# 5-10 secondsCross-Compilation
# Pure Go
CGO_ENABLED=0 GOOS=linux GOARCH=amd64 go build
# CGO - cannot cross-compile easily
GOOS=linux GOARCH=amd64 go build # Requires linux toolchain!Conditional CGO
//go:build cgo
// +build cgo
package main
import "C"
// Only compiled when cgo is enabledAlternative 1: Pure Go Implementations
Many C libraries have pure Go ports:
SQLite Example
// Using CGO (mattn/go-sqlite3)
import _ "github.com/mattn/go-sqlite3"
db, _ := sql.Open("sqlite3", "file.db")
// Using pure Go (modernc.org/sqlite)
import _ "modernc.org/sqlite"
db, _ := sql.Open("sqlite", "file.db")Performance comparison:
- CGO sqlite3: 10 µs per INSERT
- Pure Go sqlite: 50 µs per INSERT
- Overhead acceptable for most use cases
Advantages of pure Go:
- Cross-platform compatibility
- No CGO overhead
- Easier deployment
- Simpler security model
Alternative 2: WASM (WebAssembly)
For some use cases, compiling C code to WASM is viable:
# Compile C to WASM
emcripten mylib.c -o mylib.js
# Use in Go via JS shim
go tool go2wasm ...This provides isolation and better error handling than direct CGO.
Alternative 3: Shared Memory and mmap
For large datasets, share memory via files instead of function calls:
// Instead of processing in C via CGO
// Write data to file, let C process it separately
data := []byte{...}
ioutil.WriteFile("input.bin", data, 0644)
// Call C program as external process
cmd := exec.Command("./c_processor", "input.bin", "output.bin")
cmd.Run()
// Read results
results, _ := ioutil.ReadFile("output.bin")Eliminates CGO overhead entirely, trades off latency.
Alternative 4: gRPC to Sidecar
For complex integrations, run C code in a separate process:
// Go service (port 5000)
package main
import "google.golang.org/grpc"
func main() {
conn, _ := grpc.Dial("localhost:5001") // Connect to C service
client := pb.NewProcessorClient(conn)
result, _ := client.Process(ctx, &pb.Request{Data: data})
}
// C service (port 5001) listens for gRPC requests
// Benefits: Isolation, process separation, language-agnosticTradeoffs:
- Network latency (microseconds)
- Process isolation and safety
- Easy to replace or scale C component
Using purego for CGO-Free C Calls
The purego library allows calling C functions without CGO:
import "github.com/ebitengine/purego"
func init() {
purego.RegisterFunc(&someFunc, "someFunc")
}
var someFunc func(int, int) int
func main() {
result := someFunc(5, 3) // Calls C function
}Advantages:
- No CGO build overhead
- Still pays call overhead (but slightly less)
- Cross-compilation friendly
Disadvantages:
- Manual function registration
- Less safe than CGO
- Still has call overhead
Real-World Benchmark: sqlite3 via CGO vs Pure Go
package main
import (
"database/sql"
"testing"
_ "github.com/mattn/go-sqlite3" // CGO
// _ "modernc.org/sqlite" // Pure Go
)
func BenchmarkInsertCGO(b *testing.B) {
db, _ := sql.Open("sqlite3", ":memory:")
db.Exec("CREATE TABLE test (id INTEGER PRIMARY KEY, value TEXT)")
b.ResetTimer()
for i := 0; i < b.N; i++ {
db.Exec("INSERT INTO test (value) VALUES (?)", "data")
}
}
func BenchmarkInsertPureGo(b *testing.B) {
db, _ := sql.Open("sqlite", "file::memory:?cache=shared")
db.Exec("CREATE TABLE test (id INTEGER PRIMARY KEY, value TEXT)")
b.ResetTimer()
for i := 0; i < b.N; i++ {
db.Exec("INSERT INTO test (value) VALUES (?)", "data")
}
}Results:
BenchmarkInsertCGO-8 150000 6500 ns/op ← 1 µs CGO overhead
BenchmarkInsertPureGo-8 30000 35000 ns/op ← Pure Go slower but acceptableGuidelines for Using CGO
Tip: Profile your code with CGO disabled to understand the actual impact. Use
CGO_ENABLED=0to disable CGO entirely.
-
Only use CGO for substantial computations - when C work >> 100+ nanoseconds
-
Batch operations - make one CGO call instead of many
-
Avoid CGO in hot loops - measure impact with profiling
-
Consider pure Go alternatives - they're often faster than expected
-
Profile before optimizing - use
go tool pprofto find actual bottlenecks -
Test cross-compilation needs - CGO breaks ease of deployment
-
Document the tradeoff - explain why CGO was chosen over pure Go
Summary
CGO enables leveraging high-performance C libraries, but at a significant cost (100-200 ns per call). This overhead is worthwhile only when:
- The C computation takes microseconds or more
- Operations can be batched to amortize call overhead
- No pure Go equivalent exists or is significantly slower
- Deployment and cross-compilation complexity is acceptable
For most Go applications, pure Go implementations are preferable. Reserve CGO for genuinely compute-intensive operations where the C library provides substantial (10x+) speedup over pure Go alternatives.
The key principle: Make the CGO calls worth the overhead by doing significant work in C, not thin wrappers around simple operations.