ZSE
Zero-dependency Server Engine for LLM Inference
A production LLM inference engine that owns the full stack — no PyTorch, no Triton, no transformers. Models load in seconds, not minutes, and serve in a fraction of the memory other engines need.
What is ZSE?
ZSE is a production LLM inference engine that owns the full stack. There is no PyTorch, no Triton, no bitsandbytes and no transformers — just pure Python, ctypes, and a kernel compiler that emits CUDA, ROCm (HIP) and Metal directly.
The result is an engine where models load in seconds, not minutes, and serve at a fraction of the memory other engines need. The entire system is roughly 40,600 lines of code with zero third-party dependencies, installable in about 5 MB.
pip install zse-engine # one package, zero transitive ML deps
zse serve qwen-7b.zse # 7-second cold start. 5.8 GB on a T4.The problem with the stack
Modern inference servers inherit a heavy dependency tree — PyTorch, Triton and a CUDA toolkit can total ~12 GB. That bloat translates directly into multi-minute cold starts and enormous VRAM footprints, making it expensive to run anything outside a top-tier data-center GPU.
ZSE asks a different question: what if the engine owned every layer — the quantized model format, the kernels, the scheduler and the server — so nothing is paid for that isn't used?
Owning the full stack
- A custom .zse model format: pre-quantized INT4/INT8/FP16, memory-mapped so weights load instantly instead of being deserialized on every boot.
- A pure-Python kernel compiler (zse-compiler) that emits CUDA C, HIP C and Metal Shading Language — 29 GPU kernels for the inference path.
- A continuous-batching scheduler (ZStreamer) with disaggregated prefill/decode, SLO-aware ordering, chunked prefill and speculative decoding.
- An OpenAI-compatible server with API keys, rate limiting, LoRA hot-swap and built-in RAG — all on pure asyncio, no web framework.
System architecture
HTTP / SSE · OpenAI API · Web dashboard · API key + RAG
│
ZStreamer — continuous batching, scheduling
│
Orchestrator │ KV Cache (PagedAttention) │ LoRA Mgr
29 GPU kernels│ adaptive blocks · token-evict│ hot-swap
│
.zse format │ VRAM allocator (unified) │ CUDA/HIP Graphs
│
ZSE Kernel Compiler — Python DSL → GPU code
│
CUDA C (nvrtc) · HIP C (hiprtc) · Metal (MSL)
No PyTorch · No Triton · No transformersThe kernel compiler queries each GPU's compute capability at runtime and emits the correct PTX / GCN / MSL automatically — so a new architecture usually works on day one.
Cold start — every GPU, every size
ZSE INT4 vs vLLM AWQ INT4, verified on Modal (T4, L4, A10G, A100), DigitalOcean (MI300X) and Apple M1.
VRAM — fits where others can't
Single-sequence throughput
On data-center GPUs ZSE matches or beats vLLM for single-sequence decode, while staying far leaner on memory.
Built-in retrieval (RAG)
- Hybrid retrieval: BM25 + TF-IDF + dense embeddings from mean-pooled LLM hidden states — no extra embedding model needed.
- Reciprocal Rank Fusion plus an LLM cross-encoder reranker.
- ZPF compressed document format — 25% fewer LLM tokens at 100% retrieval accuracy.
- A full PDF parser handling encrypted (RC4 / AES-128 / AES-256), multi-column reflow, /ObjStm and an OCR hook.
What ZSE doesn't beat yet
We believe in numbers, not marketing. ZSE does not yet beat vLLM on concurrent throughput at N≥4 on INT4 — vLLM's hand-tuned AWQ Marlin kernels hit memory-bandwidth ceilings we're still closing on NVIDIA (already down to 2.12× on AMD via a wave-64 bgemv rewrite). Full Apple-Silicon transformer inference and socket-restricted tensor parallelism are wired and validated at the kernel level, pending broader hardware runs.
If steady-state batched throughput is your only metric and you have ~50× the VRAM budget, use vLLM. If you care about cold start, footprint, vendor lock-in, or running on anything other than an H100 — use ZSE.
Validated across vendors
NVIDIA T4 (Turing), L4 (Ada), A10G (Ampere), A100 (40/80 GB), H100 / H200 (Hopper); AMD Instinct MI300X (CDNA3); and Apple M1 — all validated. ZSE is Apache-2.0 licensed and built in Nagercoil, India.
Interested in this work?