From a8d4f02274272c4c50aa90c6944c07c57f0055cc Mon Sep 17 00:00:00 2001 From: Andy Date: Mon, 2 Mar 2026 12:37:55 +0300 Subject: [PATCH 1/2] docs: add public architecture doc, dev.to link, update performance guide MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - Add docs/ARCHITECTURE.md β€” public architecture documentation (v0.4.1) - Add dev.to article badge and blockquote link to README.md - Update Related Projects with ecosystem links (webgpu, born-ml, gogpu) - Fix broken link: docs/dev/TECHNICAL_ARCHITECTURE.md β†’ docs/ARCHITECTURE.md - Update docs/PERFORMANCE.md to v0.4.1: Go 1.26 CGO section, purego comparison --- README.md | 40 ++++-- docs/ARCHITECTURE.md | 307 +++++++++++++++++++++++++++++++++++++++++++ docs/PERFORMANCE.md | 63 +++++---- 3 files changed, 372 insertions(+), 38 deletions(-) create mode 100644 docs/ARCHITECTURE.md diff --git a/README.md b/README.md index 1c5ee43..39e7bef 100644 --- a/README.md +++ b/README.md @@ -7,9 +7,12 @@ [![Go version](https://img.shields.io/github/go-mod/go-version/go-webgpu/goffi)](https://github.com/go-webgpu/goffi/blob/main/go.mod) [![License](https://img.shields.io/badge/license-MIT-blue.svg)](LICENSE) [![Go Reference](https://pkg.go.dev/badge/github.com/go-webgpu/goffi.svg)](https://pkg.go.dev/github.com/go-webgpu/goffi) +[![Dev.to](https://img.shields.io/badge/dev.to-deep%20dive-0A0A0A?logo=devdotto)](https://dev.to/kolkov/goffi-zero-cgo-foreign-function-interface-for-go-how-we-call-c-libraries-without-a-c-compiler-ca5) **Pure Go Foreign Function Interface (FFI)** for calling C libraries without CGO. Primary use case: **WebGPU bindings** for GPU computing in pure Go. +> **Read the deep dive:** [goffi: Zero-CGO FFI for Go β€” How We Call C Libraries Without a C Compiler](https://dev.to/kolkov/goffi-zero-cgo-foreign-function-interface-for-go-how-we-call-c-libraries-without-a-c-compiler-ca5) + ```go // Call C functions directly from Go - no CGO required! handle, _ := ffi.LoadLibrary("wgpu_native.dll") @@ -253,24 +256,32 @@ ffi.PrepareCallInterface(cif, convention, returnType, argTypes) ## πŸ’Ž Why goffi? +### goffi vs purego vs CGO + | Feature | **goffi** | purego | CGO | |---------|-----------|--------|-----| | **C compiler required** | No | No | Yes | -| **Typed FFI (struct passing)** | βœ… Full struct support | ❌ Scalar only | βœ… | -| **Typed errors** | βœ… 5 error types | ❌ Generic errors | N/A | +| **API style** | libffi-like (prepare once, call many) | reflect-based (RegisterFunc) | Native | +| **Per-call allocations** | Zero (CIF reusable) | sync.Pool per call | Zero | +| **Struct pass/return** | βœ… Full (9-16B RAX+RDX, sret >16B) | βœ… Full | βœ… | +| **Callback float returns** | βœ… XMM0 in asm | ❌ panic | βœ… | +| **ARM64 HFA detection** | Recursive (nested structs) | Top-level only | Full | +| **Typed errors** | βœ… 5 error types + errors.As() | ❌ Generic | N/A | | **Context support** | βœ… Timeouts/cancellation | ❌ | ❌ | | **C-thread callbacks** | βœ… crosscall2 | βœ… crosscall2 | βœ… | -| **ARM64 performance** | 64 ns/op | ~60 ns/op | ~2 ns/op | +| **String/bool/slice args** | ❌ Raw pointers only | βœ… Auto-marshaling | βœ… | +| **Platform breadth** | 5 targets (quality focus) | 9+ architectures | All | | **AMD64 performance** | 88-114 ns/op | ~100 ns/op | ~2 ns/op | -| **Call interface reuse** | βœ… PrepareCallInterface | ❌ Reflect per call | N/A | -| **WebGPU-optimized** | βœ… Primary target | General purpose | General purpose | -**Key advantages over purego:** -- **Typed FFI** β€” pass/return structs by value, not just scalars -- **Typed errors** β€” `errors.As()` for precise error handling (`LibraryError`, `TypeValidationError`, etc.) -- **Context support** β€” `CallFunctionContext()` with timeouts and cancellation -- **Call interface reuse** β€” prepare once, call many times (zero per-call reflection overhead) -- **WebGPU focus** β€” designed specifically for GPU bindings with wgpu-native +### Design philosophy + +**goffi** is a low-level **libffi-style** interface: describe types once via `TypeDescriptor`, pre-compute classification into a `CallInterface`, call many times with zero overhead. Designed for GPU/real-time workloads where every nanosecond counts. + +**purego** is a high-level **reflect-based** wrapper: write a Go function signature, get a callable via `RegisterFunc`. More ergonomic, broader platform support, but reflect dispatch on every call. + +**Choose goffi when**: you need struct passing, zero per-call allocations, callback float returns, typed errors, or WebGPU/GPU bindings. + +**Choose purego when**: you need string auto-marshaling, broad architecture support (386, ppc64le, riscv64...), or quick one-off C library bindings. --- @@ -296,7 +307,7 @@ C Function (External Library) - Hand-written assembly for System V AMD64, Win64, and AAPCS64 ABIs - Runtime type validation (no codegen/reflection) -See [docs/dev/TECHNICAL_ARCHITECTURE.md](docs/dev/TECHNICAL_ARCHITECTURE.md) for deep dive (internal docs). +See [docs/ARCHITECTURE.md](docs/ARCHITECTURE.md) for detailed architecture documentation. --- @@ -411,7 +422,10 @@ MIT License - see [LICENSE](LICENSE) for details. ## πŸ”— Related Projects -- **[go-webgpu](https://github.com/go-webgpu/go-webgpu)** - WebGPU bindings using goffi (coming soon!) +- **[Dev.to Article](https://dev.to/kolkov/goffi-zero-cgo-foreign-function-interface-for-go-how-we-call-c-libraries-without-a-c-compiler-ca5)** - Deep dive: how goffi works, architecture, and ecosystem +- **[go-webgpu/webgpu](https://github.com/go-webgpu/webgpu)** - Zero-CGO WebGPU bindings (wgpu-native) +- **[born-ml/born](https://github.com/born-ml/born)** - ML framework for Go, GPU-accelerated +- **[gogpu](https://github.com/gogpu)** - GPU computing ecosystem (dual Rust + Pure Go backends) - **[wgpu-native](https://github.com/gfx-rs/wgpu-native)** - Native WebGPU implementation --- diff --git a/docs/ARCHITECTURE.md b/docs/ARCHITECTURE.md new file mode 100644 index 0000000..a5c32d7 --- /dev/null +++ b/docs/ARCHITECTURE.md @@ -0,0 +1,307 @@ +# Architecture: goffi FFI Implementation + +> **How goffi calls C functions from pure Go β€” assembly trampolines, calling conventions, and type safety** + +--- + +## Overview + +**goffi** is a zero-dependency Foreign Function Interface (FFI) for Go. It calls C library functions without CGO by using: + +- **Hand-written assembly** for each platform ABI +- **`runtime.cgocall`** for GC-safe Goβ†’C stack switching +- **`crosscall2`** for safe Cβ†’Go callback transitions (any thread) +- **Runtime type validation** via `TypeDescriptor` β€” no codegen, no reflection + +--- + +## Four-Layer Architecture + +Every goffi call traverses four layers: + +``` +β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” +β”‚ Layer 1: Go Code β”‚ +β”‚ ffi.CallFunction(cif, fn, &result, args) β”‚ +β”‚ Type validation, CIF pre-computation β”‚ +β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ + β”‚ + β–Ό +β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” +β”‚ Layer 2: runtime.cgocall β”‚ +β”‚ Switch to system stack (g0) β”‚ +β”‚ Mark goroutine as blocked, allow GC β”‚ +β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ + β”‚ + β–Ό +β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” +β”‚ Layer 3: Assembly Wrapper β”‚ +β”‚ Load registers per ABI (GP + SSE/FP) β”‚ +β”‚ Call target function, save return values β”‚ +β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ + β”‚ + β–Ό +β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” +β”‚ Layer 4: C Function (external library) β”‚ +β”‚ Executes and returns via standard ABI β”‚ +β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ +``` + +### Layer 1: Call Interface (CIF) Pre-computation + +Unlike reflect-based approaches, goffi classifies arguments and computes stack layout **once** at preparation time: + +```go +cif := &types.CallInterface{} +ffi.PrepareCallInterface(cif, types.DefaultCall, + types.UInt64TypeDescriptor, // return type + []*types.TypeDescriptor{types.PointerTypeDescriptor}, // arg types +) + +// cif now contains: +// - Argument classification (GP register / SSE register / stack) +// - Stack size and alignment +// - Flags bitmask for assembly dispatch +// Reuse cif for all subsequent calls β€” zero allocation per call. +``` + +### Layer 2: runtime.cgocall + +`runtime.cgocall` is Go's internal mechanism for calling C code safely: + +1. Switches to system stack (g0) +2. Marks goroutine as "in syscall" β€” allows GC to proceed +3. Calls our assembly wrapper +4. Restores Go stack on return + +We access it via `//go:linkname`: + +```go +//go:linkname runtime_cgocall runtime.cgocall +func runtime_cgocall(fn uintptr, arg unsafe.Pointer) int32 +``` + +### Layer 3: Platform Assembly + +Hand-written assembly for each ABI. The function receives a struct pointer containing all arguments, loads registers, calls the target, and saves return values. + +**System V AMD64** (`syscall_unix_amd64.s`): + +```asm +TEXT syscallN(SB), NOSPLIT|NOFRAME, $0 + // R11 = args struct pointer + // Load 6 GP registers: RDI, RSI, RDX, RCX, R8, R9 + // Load 8 SSE registers: XMM0-XMM7 + // Push stack-spill args if needed + CALL R10 // call target function + // Save RAX (int return), RDX (second return), XMM0 (float return) +``` + +**Win64** (`syscall_windows_amd64.s`): + +```asm +// 4 GP registers: RCX, RDX, R8, R9 +// 4 SSE registers: XMM0-XMM3 +// 32-byte shadow space mandatory +``` + +**AAPCS64 ARM64** (`syscall_unix_arm64.s`): + +```asm +// 8 GP registers: X0-X7 +// 8 FP registers: D0-D7 +// HFA (Homogeneous Floating-point Aggregate) support +``` + +--- + +## Calling Conventions + +| Feature | System V AMD64 | Win64 | AAPCS64 | +|---------|---------------|-------|---------| +| **GP Registers** | RDI, RSI, RDX, RCX, R8, R9 | RCX, RDX, R8, R9 | X0-X7 | +| **FP Registers** | XMM0-XMM7 | XMM0-XMM3 | D0-D7 | +| **Shadow Space** | None | 32 bytes mandatory | None | +| **Stack Alignment** | 16-byte | 16-byte | 16-byte | +| **Int Return** | RAX | RAX | X0 | +| **Float Return** | XMM0 | XMM0 | D0 | +| **Struct ≀8B** | RAX | RAX | X0 | +| **Struct 9-16B** | RAX + RDX | N/A (sret) | X0 + X1 | +| **Struct >16B** | Hidden sret pointer | Hidden sret pointer | Hidden sret pointer | +| **HFA** | N/A | N/A | D0-D3 (up to 4 floats) | + +--- + +## Struct Return Handling + +ABI rules for returning structs depend on size: + +- **≀ 8 bytes**: returned in RAX (AMD64) or X0 (ARM64) +- **9-16 bytes** (AMD64): split across RAX (low 8) + RDX (high 8) +- **> 16 bytes**: caller passes a hidden pointer as the first argument (sret) + +Implementation in `internal/arch/amd64/implementation.go`: + +```go +case types.StructType: + size := cif.ReturnType.Size + switch { + case size <= 8: + *(*uint64)(rvalue) = retVal + case size <= 16: + *(*uint64)(rvalue) = retVal // RAX β†’ bytes 0-7 + remaining := size - 8 + src := (*[8]byte)(unsafe.Pointer(&retVal2)) + dst := (*[8]byte)(unsafe.Add(rvalue, 8)) + copy(dst[:remaining], src[:remaining]) // RDX β†’ bytes 8-15 + } +``` + +--- + +## Callback System (C β†’ Go) + +Callbacks allow C code to call back into Go β€” critical for async APIs like WebGPU. + +### The Problem + +C threads (e.g., Metal/Vulkan internal threads) have no goroutine (`G = nil`). Calling Go code directly would crash the runtime. + +### Solution: crosscall2 + +``` +C thread (wgpu-native, Metal, Vulkan) + β”‚ calls trampoline (1 of 2000 pre-compiled entries) + β–Ό +Assembly dispatcher + β”‚ saves registers, loads callback index into R12 (ARM64) or stack (AMD64) + β–Ό +crosscall2 β†’ runtime.load_g β†’ runtime.cgocallback + β”‚ sets up goroutine, switches to Go stack + β–Ό +Go callback function (user code) +``` + +### Trampoline Table + +2000 pre-compiled trampoline entries per process: + +**AMD64** (`callback_amd64.s`) β€” 5 bytes per entry: + +```asm +CALL Β·callbackDispatcher // 5-byte CALL, index derived from return address +``` + +**ARM64** (`callback_arm64.s`) β€” 8 bytes per entry: + +```asm +MOVD $N, R12 // load callback index +B Β·callbackDispatcher // branch (no link β€” preserves LR) +``` + +### Usage + +```go +cb := ffi.NewCallback(func(status uint32, adapter uintptr, msg uintptr, ud uintptr) { + // Safe even when called from a C thread +}) +// Pass cb (uintptr) as a function pointer argument to C code +``` + +--- + +## Type System + +### TypeDescriptor + +All types are described at runtime via `TypeDescriptor` β€” no reflection, no codegen: + +```go +type TypeDescriptor struct { + Size uint16 // Size in bytes + Alignment uint16 // Alignment requirement + Kind TypeKind // VoidType, SInt32Type, DoubleType, StructType, etc. + Members []*TypeDescriptor // For structs (recursive) +} +``` + +Predefined descriptors for all C primitive types: `VoidTypeDescriptor`, `SInt8TypeDescriptor` through `UInt64TypeDescriptor`, `FloatTypeDescriptor`, `DoubleTypeDescriptor`, `PointerTypeDescriptor`. + +### Struct Types + +Composite types require explicit member definitions: + +```go +pointType := &types.TypeDescriptor{ + Size: 16, + Alignment: 8, + Kind: types.StructType, + Members: []*types.TypeDescriptor{ + types.DoubleTypeDescriptor, // x + types.DoubleTypeDescriptor, // y + }, +} +``` + +### Validation + +`PrepareCallInterface` validates all types at preparation time: + +- Nil checks on all descriptors +- Size > 0 for non-void types +- Struct members recursively validated +- Alignment power-of-two check +- Argument count within platform limits + +Five typed error types for precise error handling: `InvalidCallInterfaceError`, `LibraryError`, `CallingConventionError`, `TypeValidationError`, `UnsupportedPlatformError`. + +--- + +## Platform Support + +| Platform | Architecture | ABI | Status | +|----------|-------------|-----|--------| +| **Linux** | AMD64 | System V | Production | +| **Windows** | AMD64 | Win64 | Production | +| **macOS** | AMD64 | System V | Production | +| **FreeBSD** | AMD64 | System V | Production (untested) | +| **Linux** | ARM64 | AAPCS64 | Production | +| **macOS** | ARM64 | AAPCS64 | Production (tested on M3 Pro) | + +--- + +## Key Files + +| File | Purpose | +|------|---------| +| `ffi/ffi.go` | Public API: `PrepareCallInterface`, `CallFunction` | +| `ffi/cif.go` | CIF preparation, type validation, stack calculation | +| `ffi/call.go` | Delegation to platform-specific implementations | +| `ffi/errors.go` | 5 typed error types | +| `ffi/callback.go` | AMD64 Unix callback trampolines (2000 entries) | +| `ffi/callback_arm64.go` | ARM64 callback trampolines (2000 entries) | +| `ffi/callback_windows.go` | Windows callbacks via `syscall.NewCallback` | +| `types/types.go` | TypeDescriptor, CallingConvention, constants | +| `internal/arch/amd64/classification.go` | Argument/return type classification | +| `internal/arch/amd64/implementation.go` | Return value handling (`handleReturn`) | +| `internal/arch/amd64/call_unix.go` | Unix AMD64 execution | +| `internal/arch/arm64/implementation.go` | ARM64 AAPCS64 implementation | +| `internal/arch/arm64/classification.go` | HFA detection, ARM64 classification | +| `internal/syscall/syscall_unix_amd64.s` | System V AMD64 assembly | +| `internal/syscall/syscall_windows_amd64.s` | Win64 assembly | +| `internal/syscall/syscall_unix_arm64.s` | ARM64 assembly | + +--- + +## References + +1. [System V AMD64 ABI](https://gitlab.com/x86-psABIs/x86-64-ABI) +2. [Win64 Calling Convention](https://learn.microsoft.com/en-us/cpp/build/x64-calling-convention) +3. [AAPCS64 (ARM64)](https://github.com/ARM-software/abi-aa/blob/main/aapcs64/aapcs64.rst) +4. [Go runtime: cgocall.go](https://github.com/golang/go/blob/master/src/runtime/cgocall.go) +5. [purego](https://github.com/ebitengine/purego) β€” inspiration for CGO-free approach +6. [libffi](https://sourceware.org/libffi/) β€” reference for FFI architecture patterns + +--- + +*Current version: v0.4.1 | Last updated: 2026-03-02* diff --git a/docs/PERFORMANCE.md b/docs/PERFORMANCE.md index 162170c..482a713 100644 --- a/docs/PERFORMANCE.md +++ b/docs/PERFORMANCE.md @@ -1,8 +1,8 @@ -# Performance Guide - goffi v0.1.0 +# Performance Guide - goffi v0.4.1 > **Comprehensive performance analysis, benchmarks, and usage guidelines** > **Platform**: Windows AMD64, 12th Gen Intel Core i7-1255U -> **Go Version**: 1.25.3 +> **Go Version**: 1.25+ --- @@ -14,8 +14,8 @@ **Comparison**: - **goffi**: ~100 ns/op overhead -- **CGO**: ~200-250 ns/op (estimated, similar mechanism) -- **purego**: ~150-200 ns/op (estimated, similar approach) +- **CGO**: ~140-170 ns/op (Go 1.26 reduced overhead ~30%) +- **purego**: ~100-150 ns/op (similar approach) - **Direct Go**: ~0.2 ns/op (baseline) **Verdict**: goffi is **production-ready for WebGPU** and similar use cases where function calls are rare (< 10K/sec) and expensive (> 1Β΅s each). @@ -246,41 +246,54 @@ FFI overhead: 0.0001ms = 0.001% βœ… | Aspect | goffi | CGO | |--------|-------|-----| -| **Overhead** | ~100 ns | ~200-250 ns | +| **Overhead** | ~100 ns | ~140-170 ns (Go 1.26) | | **Build** | Zero deps | Requires C compiler | | **Cross-compile** | βœ… Easy | ❌ Complex | | **Static binary** | βœ… Yes | ⚠️ Often requires libc | -| **Performance** | **Better!** | Slower (more indirection) | + +> **Note**: Go 1.26 (Feb 2026) reduced CGO overhead ~30% by removing the dedicated syscall P state. goffi benefits from the same improvement β€” both use `runtime.cgocall` internally. ### goffi vs purego | Aspect | goffi | purego | |--------|-------|-------| -| **Overhead** | ~100 ns | ~150-200 ns (estimated) | -| **Type Safety** | βœ… TypeDescriptor validation | ⚠️ Manual | -| **Error Handling** | βœ… 5 typed errors | ⚠️ Generic errors | -| **Structs** | βœ… Auto layout calc | ❌ Manual | -| **API Levels** | 3 (low/mid/high planned) | 1 (low) | -| **Documentation** | βœ… Comprehensive | ⚠️ Basic | +| **Overhead** | ~100 ns | ~100-150 ns | +| **Per-call allocations** | Zero (CIF reused) | sync.Pool per call | +| **Type Safety** | βœ… TypeDescriptor validation | Go reflect.Type | +| **Error Handling** | βœ… 5 typed errors | Generic errors | +| **Callback float returns** | βœ… XMM0 in asm | ❌ panic | +| **ARM64 HFA** | Recursive struct walk | Top-level only | +| **Context support** | βœ… Timeouts/cancellation | ❌ | +| **Platforms** | 5 (quality focus) | 9+ (breadth focus) | --- -## Performance Roadmap +## Go 1.26 CGO Improvements + +Go 1.26 (released February 2026) [reduced cgo call overhead by ~30%](https://go.dev/doc/go1.26) by removing the dedicated syscall P state. Benchmarks on Apple M1 show `CgoCall` is 33% faster, `CgoCallWithCallback` is 21% faster. + +**What this means for goffi:** + +- **goffi benefits too** β€” our `runtime.cgocall` path gets the same ~30% speedup, because goffi uses the same Go runtime machinery internally +- **CGO still requires a C compiler** at build time β€” goffi does not +- **Cross-compilation** with CGO still requires cross-toolchains β€” `GOOS=linux GOARCH=arm64 go build` just works with goffi +- **Static binaries** β€” CGO often pulls in libc, goffi produces fully static Go binaries -### v0.2.0 - Profiling Tools -- [ ] Built-in profiler (`ffi.EnableProfiling()`) -- [ ] Call statistics (frequency, duration) -- [ ] Hotspot detection +The gap between CGO and pure-Go FFI is narrowing from both directions. We welcome it. + +--- + +## Performance Roadmap -### v0.5.0 - Advanced Optimizations -- [ ] JIT stub generation (reduce indirect jumps) -- [ ] Batch API (`ffi.CallBatch()` for multiple calls) -- [ ] Assembly micro-optimizations (target: ~70ns) +### v0.5.0 - Usability + Optimization +- [ ] Builder pattern API (less boilerplate) +- [ ] Variadic function support +- [ ] Assembly micro-optimizations -### v1.0.0 - Production Tuning +### v1.0.0 - Production Benchmarks +- [ ] Comprehensive benchmarks vs CGO/purego (published) - [ ] Platform-specific tuning (Linux, macOS, ARM64) -- [ ] Comprehensive benchmarks vs CGO/purego -- [ ] Real-world case studies (WebGPU, Vulkan, SQLite) +- [ ] Real-world case studies (WebGPU, Vulkan) --- @@ -367,4 +380,4 @@ benchstat before.txt after.txt *Benchmarks conducted on Windows AMD64, Intel i7-1255U @ 12 cores* *Your results may vary depending on CPU, OS, and workload* -*Last updated: 2025-01-17 | goffi v0.1.0* +*Last updated: 2026-03-02 | goffi v0.4.1* From 934087b2aa32c6aeed019be1524e793b4d4fb0bc Mon Sep 17 00:00:00 2001 From: Andy Date: Mon, 2 Mar 2026 12:39:03 +0300 Subject: [PATCH 2/2] docs: add ARCHITECTURE.md link to Documentation section in README --- README.md | 1 + 1 file changed, 1 insertion(+) diff --git a/README.md b/README.md index 39e7bef..bc173aa 100644 --- a/README.md +++ b/README.md @@ -186,6 +186,7 @@ See [CHANGELOG.md](CHANGELOG.md#known-limitations) for full details. - **[CHANGELOG.md](CHANGELOG.md)** - Version history, migration guides - **[ROADMAP.md](ROADMAP.md)** - Development roadmap to v1.0 +- **[docs/ARCHITECTURE.md](docs/ARCHITECTURE.md)** - Technical architecture deep dive - **[docs/PERFORMANCE.md](docs/PERFORMANCE.md)** - Comprehensive performance analysis - **[CONTRIBUTING.md](CONTRIBUTING.md)** - Contribution guidelines - **[SECURITY.md](SECURITY.md)** - Security policy and best practices