Research

Democratizing AI Infrastructure

Our research focuses on making high-performance AI compute accessible to everyone — from custom kernel-level optimizations for sub-second inference to decentralized GPU architectures that eliminate vendor lock-in. We're building the foundations so any developer, anywhere, can deploy production AI without needing a hyperscaler budget.

2

Active Research Areas

4+

Hardware Targets

ZSE Kernel

Phase 1 Complete

Research Area 01

ZSE — Custom Triton Kernels for Memory-Efficient LLM Inference

We built .zse — our own model format — and a custom inference engine powered by Triton kernels. Hand-written Triton kernels for INT4 matmul, fused attention, and quantization give us full control over memory layout and compute scheduling. A single .zse file packs pre-quantized INT4 weights, tokenizer, and model config into one binary. 11× faster loading, 32B models on 24GB consumer GPUs.

Research Roadmap

Phase 1 — Complete

.zse Format & Memory Optimization

  • .zse format — our custom binary model format that packs pre-quantized INT4 weights, tokenizer, and model config into a single file. Quantize once with `zse quantize`, load instantly every time
  • 11× faster loading vs standard pipelines — standard loaders re-quantize FP16→INT4 on every load (45.4s for Qwen 7B). .zse skips this entirely — 3.9s cold start
  • INT4 inference engine — custom Triton kernels for INT4 matrix multiplication with FP16 accumulation, fused attention, and quantization-aware compute. Full control over memory layout and scheduling
  • 32B models on 24GB consumer GPUs — Qwen 32B runs at 20.9GB VRAM on RTX 3090/4090, making large models accessible on commodity hardware
  • OpenAI-compatible API server — drop-in replacement with `zse serve`, works with LangChain, OpenAI SDK, and existing toolchains
  • NVIDIA CUDA backend — benchmarked on H200. Qwen 7B: 58.7 tok/s, 5.9GB VRAM. Qwen 32B: 26.9 tok/s, 20.9GB VRAM
  • AMD ROCm backend — HIP-based compute path for Radeon and Instinct GPUs
Phase 2 — In Progress

Intel & TPU Backend

  • Intel oneAPI backend — SYCL-based compute kernels targeting Xeon Scalable and Gaudi 2 accelerators via oneDNN primitives
  • Google TPU backend — XLA IR lowering with TPU-native sharding strategies and HBM bandwidth-optimized data layouts
Phase 3 — Upcoming

Adaptive Runtime

  • Dynamic batch scheduling with priority queues — latency-sensitive requests preempt throughput workloads without starvation
  • Speculative decoding pipeline — draft model generates candidate tokens verified in parallel by the full model, 2–3× generation speedup
  • Profile-guided kernel auto-tuning — runtime hardware profiling selects optimal tiling, block sizes, and memory access patterns per device
  • Quantization-aware mixed-precision engine — per-layer INT4/INT8/FP16/BF16 selection based on sensitivity analysis and hardware capability

.zse — Our Model Format

We designed .zse as a purpose-built binary format for LLM deployment. It stores pre-quantized INT4 weights alongside tokenizer and config in a single file — no re-quantization, no network calls, no separate config downloads. One file, instant load.

11× Faster Cold Starts

Qwen 7B loads in 3.9s with .zse vs 45.4s with standard loaders that re-quantize on every load. For serverless and CI/CD pipelines, this eliminates the startup tax that makes on-demand inference impractical.

Memory-Efficient INT4 Inference

Our INT4 inference engine uses custom Triton kernels for matrix multiplication with FP16 accumulation. Qwen 32B fits in 20.9GB — runnable on a single RTX 3090/4090 (24GB). No multi-GPU setups, no 80GB A100s needed.

58.7 tok/s Throughput

Qwen 7B generates at 58.7 tokens/sec on NVIDIA H200 with our custom Triton kernels. Qwen 32B achieves 26.9 tok/s — production-viable throughput on a single GPU.

Python-Native Stack

Unlike GGUF/llama.cpp which requires a separate C++ runtime, ZSE stays in the Python/HuggingFace ecosystem. Use transformers, PEFT, LangChain — your existing stack works. On Qwen 72B: .zse loads in 6.5s vs GGUF 10.2s — 1.6× faster.

OpenAI-Compatible Serving

Built-in API server with `zse serve` exposes an OpenAI-compatible endpoint. Drop-in replacement for any client using the OpenAI SDK, LangChain, or standard HTTP. One command from model file to production API.

ZSE v1.2.0 — Verified on NVIDIA H200

3.9s

7B Cold Start

vs 45.4s bitsandbytes

58.7

7B Throughput

tokens/sec

20.9GB

32B VRAM

fits RTX 3090/4090

11×

Loading Speed

faster vs standard

Research Area 02

VRAM Slicing for Decentralized GPU Computing

A novel approach to distributed inference that disaggregates VRAM across heterogeneous GPU nodes. Instead of requiring a single monolithic GPU with 80GB+ VRAM, we slice model layers across commodity GPUs connected over high-bandwidth mesh — enabling anyone with consumer hardware to serve production-scale models.

vram-slicer — architecture overview

## VRAM Slicing — Decentralized Inference Architecture

├── Model Partitioner

│ ├── Layer-wise tensor sharding with cost-model optimization

│ ├── Activation checkpointing for memory-constrained nodes

│ └── Dynamic load balancing based on per-node VRAM availability

├── Distributed VRAM Pool

│ ├── Unified virtual address space across disjoint GPU memories

│ ├── RDMA-based zero-copy tensor transfer (InfiniBand / RoCEv2)

│ └── Page-fault-driven lazy migration for cold tensor segments

├── Pipeline Scheduler

│ ├── Micro-batch pipeline parallelism across node boundaries

│ ├── Bubble minimization via 1F1B (one-forward-one-backward) scheduling

│ └── Speculative prefetch of next-layer activations during compute

└── Mesh Topology Manager

├── Automatic peer discovery and bandwidth probing

├── Topology-aware placement — minimize cross-node transfers

└── Fault tolerance with hot-standby node failover

Disaggregated VRAM Pooling

Model tensors are sharded across a cluster of heterogeneous GPUs (RTX 4090, A100, consumer cards) using a unified virtual memory abstraction. Each node contributes its available VRAM to a distributed pool — the scheduler treats the entire cluster as a single logical GPU with aggregate memory capacity.

Key Challenges

Unified virtual address space via custom page table mapping across PCIe/NVLink/InfiniBand boundaries
Coherence protocol for mutable state (optimizer moments, KV-cache) across non-uniform memory tiers
Bandwidth-aware tensor placement — hot layers on local VRAM, cold weights on remote nodes with prefetching

Layer-Wise Tensor Sharding

The model partitioner analyzes the computational graph and generates an optimal sharding plan using a cost model that accounts for per-layer FLOPs, activation sizes, and inter-node bandwidth. Unlike naive pipeline parallelism, our approach supports hybrid tensor + pipeline parallelism with automatic strategy selection.

Key Challenges

Cost-model-driven partitioning — minimizes critical path latency by balancing compute and communication
Activation checkpointing at shard boundaries — recompute vs. transfer trade-off per memory budget
Support for heterogeneous compute — slower nodes handle smaller shards proportional to their throughput

Zero-Copy Tensor Transfer

Inter-node tensor communication uses RDMA (Remote Direct Memory Access) for zero-copy GPU-to-GPU transfers, bypassing CPU and kernel overhead entirely. For nodes connected via commodity Ethernet, we implement a custom transport layer with pipelining, compression, and overlap of compute and communication.

Key Challenges

GPUDirect RDMA for InfiniBand/RoCEv2 clusters — sub-microsecond tensor transfers between nodes
Adaptive compression — 4-bit quantization of activation tensors during transfer, full precision for weights
Compute-communication overlap — next-layer prefetch during current-layer execution hides transfer latency

Fault-Tolerant Mesh Orchestration

Consumer GPU clusters are inherently unreliable — nodes join and leave, hardware fails, network partitions happen. The mesh orchestrator handles all of this transparently with automatic peer discovery, health monitoring, and hot-standby failover without dropping in-flight requests.

Key Challenges

Gossip-based peer discovery with SWIM failure detector — sub-second fault detection across the mesh
Checkpoint-free recovery — redundant tensor replicas on standby nodes enable instant failover
Dynamic re-sharding — when nodes join or leave, the partitioner rebalances without full model reload

Why This Matters

Today, serving a 70B parameter model requires a single node with 140GB+ VRAM — that's an A100 80GB in tensor parallel, costing $30K+ per GPU. VRAM slicing changes the economics entirely: a cluster of 8× RTX 4090s (24GB each, ~$1,600 each) provides 192GB aggregate VRAM at a fraction of the cost. This unlocks production-scale AI for startups, researchers, and anyone who doesn't have a hyperscaler budget.

5–10×

Cost Reduction

vs single-node H100/A100

192GB+

Aggregate VRAM

8× consumer GPUs

180B+

Max Model Size

distributed across mesh

<1s

Failover Time

hot-standby recovery

The End Goal

Every piece of research we do feeds into one mission: make AI infrastructure so efficient and accessible that it stops being a competitive moat. When a solo developer in Nagercoil can deploy a 70B model with the same performance as a Silicon Valley company on A100s — that's when AI is truly democratized. We're not there yet, but every kernel we write gets us closer.