January 20, 2026

Running 284B AI Models on Your Own Hardware: The DS4 Project and Local Inference

How antirez's DwarfStar engine, DeepSeek V4 Flash's MoE architecture, and aggressive quantization are making frontier-level AI accessible outside the cloud — and what it means for developers who care about privacy, cost, and control.

AI
Local LLM
DeepSeek
Open Source
Hardware

Running a 284-billion-parameter language model on a laptop would have seemed absurd just two years ago. Today, it’s a reality — and it’s changing how we think about AI infrastructure.

The breakthrough isn’t a single invention. It’s the convergence of three things: Mixture-of-Experts architectures that activate only a fraction of parameters per token, quantization techniques that compress models to a fraction of their original size, and inference engines like DwarfStar (DS4) that are purpose-built to squeeze every ounce of performance out of consumer hardware.

This is about taking AI out of the cloud and putting it where it belongs: on your own machines.

The Model: DeepSeek V4 Flash

DeepSeek V4 Flash is a Mixture-of-Experts language model with 284 billion total parameters. But here’s what makes it special: only about 13 billion are activated per token during inference.

This distinction — between total parameters and active parameters — is the key to understanding why MoE models can punch far above their weight. Every parameter needs to be loaded into memory, but only the activated experts need to be computed. That’s why V4 Flash can deliver frontier-level reasoning and instruction-following capability while being dramatically more efficient than dense models of comparable size.

The model comes in two variants: Flash (the efficiency-optimized version) and PRO (the higher-capability version that requires even more memory). Both use Multi-head Latent Attention (MLA) to compress KV cache significantly — up to 93% reduction compared to standard attention mechanisms. This makes long-context inference practical. V4 Flash supports up to 1 million tokens of context.

It’s released under the MIT license. That means you can run it, modify it, and deploy it without any restrictions. The weights are open. The implications are significant.

The Engine: DwarfStar (DS4)

Enter DwarfStar — or DS4 — a native inference engine created by Salvatore Sanfilippo, better known as antirez, the creator of Redis. Released in May 2026, it currently has 17.5k GitHub stars.

DS4 is intentionally narrow. It’s not a general-purpose GGUF runner like llama.cpp. It’s not a wrapper around another runtime. It’s a completely self-contained, pure-C inference engine optimized specifically for DeepSeek V4 Flash (and PRO). The reasoning is straightforward: new models are released continuously, and the attention immediately gets captured by the next model to implement. DS4 takes a different approach — one model at a time, done properly end to end.

The project is developed with strong assistance from GPT 5.5, which antirez says openly: “If you are not happy with AI-developed code, this software is not for you.”

What makes DS4 different

DS4 handles everything end to end: model loading, prompt rendering, tool calling, KV state management (both in RAM and on-disk), a server API, and even an integrated coding agent. It’s not just about running the model — it’s about making the model feel finished.

The engine supports three backends:

Metal (macOS) — the primary target, optimized for Apple Silicon
CUDA (Linux) — with special care for NVIDIA’s DGX Spark
ROCm (Linux) — for AMD Strix Halo systems

There’s also a CPU path, but antirez notes a macOS bug that causes kernel crashes with CPU inference: “each time you have to restart the computer, which is not funny.”

The KV cache is a disk citizen now

One of DS4’s most interesting innovations is its treatment of the KV cache. The project’s philosophy is that compressed KV caches combined with fast SSDs should change our assumption that KV cache belongs exclusively in RAM. On modern MacBooks with fast SSDs, the KV cache becomes a first-class disk citizen. This fundamentally changes the mental model: RAM goes from being a hard cutoff (can I run this model or not?) to a continuous spectrum of speed levels.

Quantization: The 7:1 Compression

DeepSeek V4 Flash in FP16 weighs in at approximately 568 GB. That’s far beyond any consumer machine. DS4 solves this with selective (mixed-precision) quantization.

Standard quantization applies the same bit reduction uniformly across all layers. DS4 takes a different approach: not all layers contribute equally to model quality. High-sensitivity layers (early attention, output-proximal layers, specific MLP components) are kept at higher precision, while many intermediate layers tolerate much heavier compression.

The result for V4 Flash:

Format	Size	Best for
Q4-IMatrix	~256 GB	256GB+ RAM machines
Q2-Q4-IMatrix	~96 GB	96-128 GB RAM machines
Q2-IMatrix	~81 GB	96-128 GB RAM machines
PRO Q2	~256 GB	512 GB RAM machines

The 2-bit quantization isn’t a joke — it behaves well under coding agents and tool calling. The quantization is asymmetrical: only the routed MoE experts are quantized down to IQ2_XXS / Q2_K, while shared experts, projections, and routing layers are left at higher precision. This is where the quality preservation comes from.

Hardware Requirements: What You Actually Need

The barrier to entry for local inference has shifted dramatically. Here’s what the landscape looks like in 2026:

MacBook Pro M3/M4 Max (128 GB unified memory)

This is the sweet spot. The Q2-IMatrix quantization at ~81 GB leaves about 47 GB for the OS, KV cache, and inference overhead. On a MacBook Pro M3 Max with 128 GB, DS4 achieves roughly 26-27 tokens/second generation speed with a short prompt and 21-25 tokens/second with long-context prompts (11k+ tokens).

The M5 Max pushes this to 34 tokens/second on short prompts.

MacBook Pro M3/M4 Ultra (192-512 GB unified memory)

For those with access to Ultra-class chips, the Q4 quantization becomes viable. A Mac Studio M3 Ultra with 512 GB runs Q4 at 35-37 tokens/second and even PRO Q2 at roughly 9.5 tokens/second — slow, but functional for inspection and occasional work.

NVIDIA DGX Spark (128 GB unified memory)

NVIDIA’s compact personal AI supercomputer built around the GB10 Grace Blackwell Superchip. It achieves around 13-14 tokens/second with V4 Flash Q2. The Blackwell GPU architecture provides significantly faster tensor operations than Apple Silicon, and native CUDA support means wider framework compatibility.

AMD Strix Halo

The new AMD Strix Halo platform (used in systems like the Framework Desktop) offers unified memory designs similar to Apple Silicon. DS4 supports ROCm for these systems, though benchmarks are still emerging.

SSD Streaming: Running Models Larger Than RAM

Here’s where it gets interesting. DS4 has an SSD streaming mode that lets you run models larger than your available RAM. The non-routed model weights stay resident in memory, while routed MoE experts are kept in an in-memory cache and loaded from the GGUF file on cache misses.

Modern Mac SSDs are fast enough to make cache misses tolerable. Long prefills can still be fast; generation is more sensitive to cache misses because every new token routes through experts again.

On a 64 GB MacBook, you can run the Q2 Flash GGUF with SSD streaming and a 32 GB expert cache:

./download_model.sh q2-imatrix
./ds4 -m ./ds4flash.gguf \
  --ssd-streaming \
  --ssd-streaming-cache-experts 32GB \
  --ctx 32768 \
  --nothink

This turns the question from “can I run this model?” into “how fast will it run?”

Distributed Inference: Combining Multiple Machines

For the truly ambitious, DS4 supports distributed inference across multiple machines. You can run the full PRO Q4 quantization across two 512 GB Mac Studios by splitting transformer layers: one machine handles layers 0-30, the other handles layers 31 through output.

The prefill path is pipelined — on two M5 Max machines connected by Thunderbolt 5, a 63k-token prompt saw a 1.85x speedup in prefill. Generation is strictly autoregressive, so distributed generation is actually slower than single-machine (due to cross-machine activation hops per token), but the capacity to run larger models is the win.

A More Accessible Alternative: Qwen3.6-35B-A3B

Not everyone has 96 GB of RAM. If you’re working with more modest hardware, the Qwen3.6-35B-A3B model is worth considering.

Released in April 2026 under Apache 2.0, this is a 35B total parameter MoE model that activates only 3B per token. It features a hybrid architecture combining Gated DeltaNet, MoE, and Gated Attention — and it’s genuinely impressive for its size.

SWE-bench Verified: 75.0% — beating Qwen3.5-27B and Gemma4-31B.

Hardware requirements for Qwen3.6-35B-A3B

Precision	VRAM/RAM Needed	Hardware
Q4_K_M	~18-20 GB	24GB GPU (RTX 3090/4090)
IQ4_XS	~14-16 GB	16GB GPU + KV cache optimization
Q8_K_XL	~36 GB	48GB GPU or CPU+GPU hybrid
CPU-only	~70 GB	64-128GB RAM desktop

This model runs comfortably on a 24GB RTX 3090/4090 with Q4 quantization. On CPU-only machines with large RAM (64-128 GB DDR5), it’s viable though slower. Apple Mac Studio/Max with 64-128GB unified memory is excellent for running this model.

The model is available on Ollama (ollama run qwen3.6:35b-a3b), in GGUF format for llama.cpp, and via vLLM for serving. Community quantized versions offer various tradeoffs — the IQ4_XS quantization by oamazonasgabriel is specifically designed for 24GB VRAM.

The Tooling Ecosystem

DS4 is one player in a rich local inference landscape. Here’s how it fits:

llama.cpp — The foundation. 119k GitHub stars. Plain C/C++, no dependencies, supports quantization from 1.5-bit to 8-bit, runs on everything from Apple Silicon to RISC-V. DS4 itself exists because of the path opened by llama.cpp and GGML.

Ollama — The most accessible entry point. One command, model management, API server. Supports GGUF via llama.cpp backend. Great for getting started quickly.

vLLM — High-throughput serving for production. PagedAttention, continuous batching, tensor parallelism. Best when you need to serve many concurrent requests.

Unsloth — Fast fine-tuning and inference with a web UI. Supports GGUF export, 2x faster training, 70% less VRAM.

The choice depends on your use case: llama.cpp for maximum compatibility, Ollama for simplicity, vLLM for production serving, DS4 for DeepSeek-specific optimization.

Why Run AI Locally?

The benefits are compelling:

Cost savings. API costs for frontier models add up quickly. A single long-context reasoning session can cost more than the hardware to run it locally.

Data privacy. Sensitive data stays on your device. No network calls to external APIs. Critical for industries handling confidential information.

Reduced latency. No round-trip to a cloud server. For interactive applications, this matters.

Full control. No rate limits. No model changes dictated by a provider. You own the deployment.

Offline capability. Works without internet. Useful for travel, secure environments, or just reliability.

The Tradeoffs

It’s not all wins. There are real tradeoffs:

Hardware requirements. The DS4 project starts at 96 GB of RAM for the Flash model. That’s a MacBook Pro M-series at the top memory tier, or a Mac Studio. Not exactly budget hardware.

Performance tradeoffs. Quantization introduces quality loss. The selective approach preserves most capability, but you’ll notice degradation in very long-context retrieval, precise numerical reasoning, and some creative writing nuance.

Model lock-in. DS4 only runs the GGUF files published for this project. It’s not a general GGUF loader. If a better open-weight model is released, DS4 may switch or drop support. As antirez puts it: “the project is strictly opportunistic.”

SSD wear. Constant model loading and streaming puts additional wear on SSDs. The tradeoff is worth it for most users, but it’s worth knowing.

Getting Started

If you want to try local inference right now, here’s a practical path:

Start small. Run Qwen3-8B or DeepSeek-R1 8B on any modern GPU or even CPU. Get comfortable with the tooling.
Move to mid-range. An RTX 3090/4090 (24GB) or M-series Mac with 32-48GB can run 32B-class models comfortably with Q4 quantization.
Go big or go home. 96-128 GB of unified memory opens the door to DeepSeek V4 Flash via DS4. This is where the frontier capability meets local execution.
Experiment with quantization. Try different quantization levels and measure the quality tradeoffs for your specific use case. What matters most for your workflow determines the right balance.

The DS4 project itself is straightforward to set up:

git clone https://github.com/antirez/ds4
cd ds4
make  # macOS Metal
# or
make cuda-generic  # Linux CUDA
./download_model.sh q2-imatrix
./ds4

The project is beta quality — antirez is transparent about this. But it’s usable, actively developed, and represents one of the most ambitious approaches to local inference I’ve seen.

What This Means for the Industry

The ability to run frontier-class models locally changes the economics of AI. When the model is open-weight, the inference engine is purpose-built, and the hardware is consumer-grade, the cloud becomes a choice rather than a requirement.

The MoE revolution is accelerating this shift. Models like DeepSeek V4 Flash (284B total, 13B active), Qwen3.6-35B-A3B (35B total, 3B active), and gpt-oss (20B variant on consumer GPUs) prove that total parameter count no longer equals hardware requirements. Only active experts matter during inference.

Better quantization methods (Unsloth Dynamic 2.0, IQ4_XS, APEX) achieve near-lossless quality at 4-bit. Apple Silicon’s unified memory architecture enables running 70B+ models on a single machine. CPU-only inference via llama.cpp improvements makes large models runnable on RAM-only systems.

We’re entering an era where “local first” is a viable strategy, not just a privacy ideal. The hardware is here. The models are here. The inference engines are here. The question is no longer whether you can run AI locally — it’s whether you should.