Ultra memory-efficient LLM inference engine. Pre-quantized .zse format, custom CUDA kernels, 55× faster cold starts. One file, zero setup, OpenAI-compatible API.
The .zse format packages model weights, tokenizer, and configuration into a single memory-mapped binary. No scattered files, no runtime quantization, no internet required.
Magic bytes + JSON metadata (arch, config, tensor map)
Embedded msgpack tokenizer — no external files needed
Packed INT4 weights — 2 values per byte, mmap-ready
Triton-based INT4 inference kernels optimized for maximum VRAM efficiency. 12-14% less memory than bitsandbytes.
Choose ZSE Custom kernels for max memory savings, or bitsandbytes backend for peak throughput. Switch per model.
Native support since v1.4.0. Fine-tune INT4 base models with tiny LoRA adapters (~25 MB). Train on consumer hardware.
Drop-in replacement for /v1/chat/completions. Swap your endpoint URL and you're running local inference.
Priority-based scheduler for high-throughput server deployments. Handle concurrent requests efficiently.
Models work fully offline — no network calls during load or inference. Ship models on USB drives if you want.
Tested on H200 GPU. ZSE Custom kernel mode. March 2026.
| Model | ZSE VRAM | bnb VRAM | Savings | Speed | Cold Start |
|---|---|---|---|---|---|
| Qwen 7B | 5.67 GB | 6.57 GB | −14% | 37.2 tok/s | 5.7s |
| Qwen 14B | 10.08 GB | 11.39 GB | −12% | 20.8 tok/s | 10.5s |
| Qwen 32B | 19.47 GB | 22.27 GB | −13% | 10.9 tok/s | 20.4s |
| Qwen 72B | 41.54 GB | 47.05 GB | −12% | 6.3 tok/s | 51.8s |
How ZSE stacks up against popular inference solutions.
| Feature | HuggingFace / bnb | vLLM | ZSE |
|---|---|---|---|
| Cold Start (32B) | ~120s | ~60s | 24s |
| File Format | Many files | Many files | Single .zse |
| VRAM (72B INT4) | ~48 GB | ~46 GB | 41.5 GB |
| Setup | Complex | Medium | pip install |
| Offline Mode | Variable | Good | Excellent |
| Fine-tuning | Separate tools | No | Built-in QLoRA |
Pull models, convert from HuggingFace, serve APIs, chat interactively, and monitor hardware — all from a single CLI.
zse pull <model>Download pre-converted models from ZLLM Hub
zse convert <hf_model> -o model.zseConvert any HuggingFace model to .zse
zse serve <model> --port 8000Launch OpenAI-compatible API server
zse chat <model>Interactive terminal chat with any model
zse hardwareReal-time GPU / VRAM diagnostics
zse listBrowse all available models on the hub
ZSE automatically detects available VRAM and picks the optimal strategy.
8 GB
12–16 GB
24 GB
80 GB+
Install ZSE, pull a model, and start serving in under a minute. No cloud. No API keys. Just your GPU and a terminal.