Z Server Engine · v1.4.1

Run 72B models
on consumer GPUs.

Ultra memory-efficient LLM inference engine. Pre-quantized .zse format, custom CUDA kernels, 55× faster cold starts. One file, zero setup, OpenAI-compatible API.

$pip install zllm-zse
zse v1.4.1
~ pip install zllm-zse
Successfully installed zllm-zse-1.4.1
~ zse pull qwen2.5-7b
⠋ Downloading qwen2.5-7b.zse (5.67 GB)...
✓ Model ready — single .zse file, zero setup
~ zse serve qwen2.5-7b --port 8000
✓ Cold start: 5.7s | VRAM: 5.67 GB
✓ OpenAI-compatible API → localhost:8000
41.5 GB
72B model in VRAM
5.7s
Cold start (7B)
55×
Faster than HuggingFace loading
14%
Less VRAM than bitsandbytes
Format

One file. Zero friction.

The .zse format packages model weights, tokenizer, and configuration into a single memory-mapped binary. No scattered files, no runtime quantization, no internet required.

Single-file distribution — weights + tokenizer + config
Group-wise INT4 quantization (group size 128, absmax)
Memory-mapped loading — 55× faster cold starts
Zero network calls — fully offline inference
Magic bytes ZSE\x00 header with JSON metadata
model.zse file structure
HEADER~2 KB

Magic bytes + JSON metadata (arch, config, tensor map)

TOKENIZER~4 MB

Embedded msgpack tokenizer — no external files needed

TENSOR DATA5.6 GB

Packed INT4 weights — 2 values per byte, mmap-ready

Total: ~5.67 GB (Qwen 7B)Apache 2.0
Features

Built for real-world deployment.

Custom CUDA Kernels

Triton-based INT4 inference kernels optimized for maximum VRAM efficiency. 12-14% less memory than bitsandbytes.

Dual Backend System

Choose ZSE Custom kernels for max memory savings, or bitsandbytes backend for peak throughput. Switch per model.

QLoRA Fine-tuning

Native support since v1.4.0. Fine-tune INT4 base models with tiny LoRA adapters (~25 MB). Train on consumer hardware.

OpenAI-Compatible API

Drop-in replacement for /v1/chat/completions. Swap your endpoint URL and you're running local inference.

Continuous Batching

Priority-based scheduler for high-throughput server deployments. Handle concurrent requests efficiently.

Zero-Dependency Deploy

Models work fully offline — no network calls during load or inference. Ship models on USB drives if you want.

Benchmarks

Numbers don't lie.

Tested on H200 GPU. ZSE Custom kernel mode. March 2026.

VRAM Usage (ZSE Custom Kernel)

Qwen 7B5.67 GB
Qwen 14B10.08 GB
Qwen 32B19.47 GB
Qwen 72B41.54 GB

Inference Speed

Qwen 7B37.2 tok/s
Qwen 14B20.8 tok/s
Qwen 32B10.9 tok/s
Qwen 72B6.3 tok/s
ModelZSE VRAMbnb VRAMSavingsSpeedCold Start
Qwen 7B5.67 GB6.57 GB14%37.2 tok/s5.7s
Qwen 14B10.08 GB11.39 GB12%20.8 tok/s10.5s
Qwen 32B19.47 GB22.27 GB13%10.9 tok/s20.4s
Qwen 72B41.54 GB47.05 GB12%6.3 tok/s51.8s

ZSE vs. the rest.

How ZSE stacks up against popular inference solutions.

FeatureHuggingFace / bnbvLLMZSE
Cold Start (32B)~120s~60s24s
File FormatMany filesMany filesSingle .zse
VRAM (72B INT4)~48 GB~46 GB41.5 GB
SetupComplexMediumpip install
Offline ModeVariableGoodExcellent
Fine-tuningSeparate toolsNoBuilt-in QLoRA
CLI

One tool. Every workflow.

Pull models, convert from HuggingFace, serve APIs, chat interactively, and monitor hardware — all from a single CLI.

Supported Architectures
Qwen 2.5Qwen CoderLlama 3Llama 3.1MistralMixtral (MoE)DeepSeekTinyLlama
zse pull <model>

Download pre-converted models from ZLLM Hub

zse convert <hf_model> -o model.zse

Convert any HuggingFace model to .zse

zse serve <model> --port 8000

Launch OpenAI-compatible API server

zse chat <model>

Interactive terminal chat with any model

zse hardware

Real-time GPU / VRAM diagnostics

zse list

Browse all available models on the hub

Compatibility

Your GPU. Your model.

ZSE automatically detects available VRAM and picks the optimal strategy.

RTX 3070 / 4070

8 GB

7B Models

RTX 3080 / 4080

12–16 GB

14B Models

RTX 3090 / 4090

24 GB

32B Models

A100 / H100 / H200

80 GB+

72B Models

Ready to run models at the edge?

Install ZSE, pull a model, and start serving in under a minute. No cloud. No API keys. Just your GPU and a terminal.

$pip install zllm-zse && zse pull qwen2.5-7b && zse serve qwen2.5-7b