locomp
A Python GPU Kernel Compiler for Apple Silicon, NVIDIA, AMD & RISC-V
Write a GPU kernel once as a plain Python function. locomp compiles it through an SSA intermediate representation into native Metal, CUDA, HIP or RISC-V vector code — one kernel runs everywhere, no rewrites.
One kernel, every GPU
locomp is an open-source GPU kernel compiler. You write a GPU kernel as a plain Python function, and locomp compiles it through an SSA intermediate representation into native code for your target hardware — Metal on Apple Silicon, CUDA C on NVIDIA, HIP C on AMD ROCm, or C + RISC-V Vector (RVV) intrinsics on RISC-V.
Think Triton, but hardware-agnostic — one @locomp.kernel runs on M1, A100, MI300X and RISC-V without rewriting a line. Triton targets NVIDIA only; locomp targets all four.
import locomp
import numpy as np
@locomp.kernel
def vector_add(X: locomp.Tensor, Y: locomp.Tensor, O: locomp.Tensor,
N: locomp.constexpr):
i = locomp.program_id(0)
locomp.store(O + i, locomp.load(X + i) + locomp.load(Y + i))How it works
Your Python function is compiled, not interpreted. The compiled pipeline is cached per constexpr configuration, so repeated calls have near-zero overhead.
@locomp.kernel (Python function)
│
Python AST → Locomp IR (SSA, 60+ opcodes)
│
Optimizer (CSE · DCE · constant folding · type
inference · strength reduction)
│
├── backend="metal" → Metal Shading Language → Apple GPU
├── backend="cuda" → CUDA C (nvcc -O3) → NVIDIA GPU
├── backend="rocm" → HIP C (hipcc -O3) → AMD GPU
└── backend="riscv" → C + RVV intrinsics → RISC-V CPUBuilt for serving engines
locomp is designed to drop in as the kernel-compilation layer under a GPU inference server. Instead of shipping pre-compiled shaders or PTX, the server compiles kernels at first request and caches them permanently.
- Specialization per config — constexpr params (batch size, seq len, head dim) become hardware literals, generating a separate optimized pipeline per shape with no dynamic dispatch overhead.
- Persistent pipeline cache — compiled kernels are written to ~/.cache/locomp/, so server restarts are instant after the first run.
- Async dispatch + batch mode — multiple kernel calls in one command buffer; GPU pipelines work while the CPU prepares the next batch.
- Backend-agnostic — the same kernel auto-selects Metal / CUDA / ROCm / RISC-V based on available hardware.
NVIDIA A100 vs PyTorch
Large-model kernels, fused with no temporary allocations.
Apple M1 vs MLX 0.31
float32, median of 10 runs after 3 warmup. Speedup > 1 = locomp wins.
AMD MI300X — memory bandwidth
HIP C compiled with hipcc --offload-arch=gfx942 -O3, measured on real hardware (192 GB HBM3, 5,300 GB/s theoretical peak).
A full LLM on pure kernels
SmolLM2-135M runs entirely on @locomp.kernel — no PyTorch, no MLX, no Metal C++. The complete inference loop is just ten pure Python kernel functions: rms_norm, matvec, silu_mul, rope, gqa_attn, kv_cache_update, and a handful of element-wise ops.
$ python examples/54_smollm2_inference.py
SmolLM2-135M — locomp GPU inference
Loading weights... 272 tensors, 538MB
Prompt: "The meaning of life is"
Output: "to be found in the meaning of the universe."
Decode: 7.9 tok/sWhere locomp fits
locomp is the only Python kernel compiler that targets Apple Silicon, NVIDIA CUDA, AMD ROCm and RISC-V from a single source — and the only one with CPU + GPU autograd, built-in auto-tuning and native INT4/INT8 quantization support.
Tested everywhere
- Apple M1 & M4 — 227 passed, 22 skipped.
- NVIDIA A100 (Modal) — 64/64 execution checks; A10G — 17/17 CUDA codegen + runtime.
- RISC-V RVV under QEMU — 9/9 execution tests, all matching NumPy.
- 63 kernel examples shipped, from vector add to flash attention and full LLM inference. Apache-2.0 licensed.
Interested in this work?