Our research focuses on making high-performance AI compute accessible to everyone — from custom kernel-level optimizations for sub-second inference to decentralized GPU architectures that eliminate vendor lock-in. We're building the foundations so any developer, anywhere, can deploy production AI without needing a hyperscaler budget.
2
Active Research Areas
4+
Hardware Targets
ZSE Kernel
Phase 1 Complete
We built .zse — our own model format — and a custom inference engine powered by Triton kernels. Hand-written Triton kernels for INT4 matmul, fused attention, and quantization give us full control over memory layout and compute scheduling. A single .zse file packs pre-quantized INT4 weights, tokenizer, and model config into one binary. 11× faster loading, 32B models on 24GB consumer GPUs.
We designed .zse as a purpose-built binary format for LLM deployment. It stores pre-quantized INT4 weights alongside tokenizer and config in a single file — no re-quantization, no network calls, no separate config downloads. One file, instant load.
Qwen 7B loads in 3.9s with .zse vs 45.4s with standard loaders that re-quantize on every load. For serverless and CI/CD pipelines, this eliminates the startup tax that makes on-demand inference impractical.
Our INT4 inference engine uses custom Triton kernels for matrix multiplication with FP16 accumulation. Qwen 32B fits in 20.9GB — runnable on a single RTX 3090/4090 (24GB). No multi-GPU setups, no 80GB A100s needed.
Qwen 7B generates at 58.7 tokens/sec on NVIDIA H200 with our custom Triton kernels. Qwen 32B achieves 26.9 tok/s — production-viable throughput on a single GPU.
Unlike GGUF/llama.cpp which requires a separate C++ runtime, ZSE stays in the Python/HuggingFace ecosystem. Use transformers, PEFT, LangChain — your existing stack works. On Qwen 72B: .zse loads in 6.5s vs GGUF 10.2s — 1.6× faster.
Built-in API server with `zse serve` exposes an OpenAI-compatible endpoint. Drop-in replacement for any client using the OpenAI SDK, LangChain, or standard HTTP. One command from model file to production API.
3.9s
7B Cold Start
vs 45.4s bitsandbytes
58.7
7B Throughput
tokens/sec
20.9GB
32B VRAM
fits RTX 3090/4090
11×
Loading Speed
faster vs standard
A novel approach to distributed inference that disaggregates VRAM across heterogeneous GPU nodes. Instead of requiring a single monolithic GPU with 80GB+ VRAM, we slice model layers across commodity GPUs connected over high-bandwidth mesh — enabling anyone with consumer hardware to serve production-scale models.
## VRAM Slicing — Decentralized Inference Architecture
│
├── Model Partitioner
│ ├── Layer-wise tensor sharding with cost-model optimization
│ ├── Activation checkpointing for memory-constrained nodes
│ └── Dynamic load balancing based on per-node VRAM availability
│
├── Distributed VRAM Pool
│ ├── Unified virtual address space across disjoint GPU memories
│ ├── RDMA-based zero-copy tensor transfer (InfiniBand / RoCEv2)
│ └── Page-fault-driven lazy migration for cold tensor segments
│
├── Pipeline Scheduler
│ ├── Micro-batch pipeline parallelism across node boundaries
│ ├── Bubble minimization via 1F1B (one-forward-one-backward) scheduling
│ └── Speculative prefetch of next-layer activations during compute
│
└── Mesh Topology Manager
├── Automatic peer discovery and bandwidth probing
├── Topology-aware placement — minimize cross-node transfers
└── Fault tolerance with hot-standby node failover
Model tensors are sharded across a cluster of heterogeneous GPUs (RTX 4090, A100, consumer cards) using a unified virtual memory abstraction. Each node contributes its available VRAM to a distributed pool — the scheduler treats the entire cluster as a single logical GPU with aggregate memory capacity.
Key Challenges
The model partitioner analyzes the computational graph and generates an optimal sharding plan using a cost model that accounts for per-layer FLOPs, activation sizes, and inter-node bandwidth. Unlike naive pipeline parallelism, our approach supports hybrid tensor + pipeline parallelism with automatic strategy selection.
Key Challenges
Inter-node tensor communication uses RDMA (Remote Direct Memory Access) for zero-copy GPU-to-GPU transfers, bypassing CPU and kernel overhead entirely. For nodes connected via commodity Ethernet, we implement a custom transport layer with pipelining, compression, and overlap of compute and communication.
Key Challenges
Consumer GPU clusters are inherently unreliable — nodes join and leave, hardware fails, network partitions happen. The mesh orchestrator handles all of this transparently with automatic peer discovery, health monitoring, and hot-standby failover without dropping in-flight requests.
Key Challenges
Today, serving a 70B parameter model requires a single node with 140GB+ VRAM — that's an A100 80GB in tensor parallel, costing $30K+ per GPU. VRAM slicing changes the economics entirely: a cluster of 8× RTX 4090s (24GB each, ~$1,600 each) provides 192GB aggregate VRAM at a fraction of the cost. This unlocks production-scale AI for startups, researchers, and anyone who doesn't have a hyperscaler budget.
5–10×
Cost Reduction
vs single-node H100/A100
192GB+
Aggregate VRAM
8× consumer GPUs
180B+
Max Model Size
distributed across mesh
<1s
Failover Time
hot-standby recovery
Every piece of research we do feeds into one mission: make AI infrastructure so efficient and accessible that it stops being a competitive moat. When a solo developer in Nagercoil can deploy a 70B model with the same performance as a Silicon Valley company on A100s — that's when AI is truly democratized. We're not there yet, but every kernel we write gets us closer.