Experimental dotLLM is a learning project — not production ready yet. Expect incomplete features, rough edges, and breaking changes.

dotLLM

High-performance LLM inference engine written natively in C#/.NET. Not a wrapper — a ground-up implementation.

What is dotLLM?

An open-source LLM inference engine built from the ground up in C# targeting .NET 10. It runs transformer-based models with SIMD-optimized CPU and CUDA GPU backends — no Python, no wrappers, no bindings to external runtimes.

It's also a learning project: a place to explore how modern inference engines work end to end, from GGUF loading and quantized matmul kernels to paged KV-cache and speculative decoding.

And it's an ongoing experiment in AI-assisted coding — Claude Code, Gemini, Codex — to see how far modern coding agents can push a complex systems project when driven by a human who knows the domain. Not vibe-coding.

01 · overview

Features

Quantization

FP16, Q8_0, Q4_0/1, Q5_0/1, Q4_K, Q5_K, Q6_K (incl. Q4_K_M) — with fused dequantize-and-accumulate matmul kernels operating directly on quantized blocks. Every SIMD path has a verified scalar fallback.

Tool Calling

Parallel function calling with model-specific parsers for Llama, Hermes, Mistral, and a Generic fallback. JSON-schema-constrained arguments via structured output — no malformed tool calls, no retries.

Structured Output

FSM/PDA-based logit masking applied at every decoding step. Guarantees syntactically valid JSON, JSON Schema, regex, or GBNF grammar output — no retries, no post-processing — via IDecodingConstraint.

Paged KV-Cache

Block-allocated cache with reference counting and copy-on-write — eliminates fragmentation for long contexts. Q8_0/Q4_0 KV quantization with a mixed-precision window extends effective context 2–4×.

CPU/GPU Hybrid

Run models larger than your VRAM by splitting layers between GPU and CPU. Pass --gpu-layers N and dotLLM runs the first N transformer layers on CUDA and the rest on the CPU backend, with automatic tensor transfer at the layer boundary.

Composable Samplers

Build sampling pipelines from independent ISamplerStep components — temperature, top-k, top-p, min-p, repetition/frequency/presence penalties. Steps are reorderable and individually testable.

Prompt Caching

Hash-based prefix cache for multi-turn conversations — the matching prefix skips prefill and only new tokens are processed. LRU eviction, configurable session count, and dramatically lower time-to-first-token on chat.

Speculative Decoding

A smaller draft model proposes candidate tokens, the target model verifies in a single forward pass. Automatic KV-cache rollback on mismatch and constraint-aware — identical output, fewer forward passes.

Continuous Batching Planned

Iteration-level scheduling with dynamic request arrival — new prompts will join the batch without waiting. Priority queues and preemption policies to maximize GPU utilization across concurrent requests. Landing in Phase 9.

First-class SAE support Planned

Load pre-trained sparse autoencoders, attach to any layer, and decompose activations into interpretable features — plus feature steering and ablation, all inside the same inference pipeline. Built on the Phase 7 hook system.

Multi-GPU Planned

Tensor parallelism via NCCL with explicit DevicePlacement on every tensor operation. The abstractions are baked into the core from day one, ready for the implementation to land.

02 · performance

Built for Performance

100% .NET

Pure C# pipeline

Tokenizer, sampler, scheduler, kernels — all C#. No Python, no foreign runtime, no llama.cpp wrapper. The entire inference path runs on .NET 10 with full debugger and profiler support. Just dotnet run.

AVX · CUDA

SIMD and GPU kernels

AVX2/AVX-512 vectorized CPU kernels via System.Runtime.Intrinsics and TensorPrimitives. CUDA backend through PTX kernels loaded via the Driver API — no native shared library to build.

0 alloc

Zero GC pressure

All tensor data lives in NativeMemory.AlignedAlloc with 64-byte alignment for AVX-512. Zero managed allocations on the hot path; Server GC runs in SustainedLowLatency mode during generation.

mmap

Memory-mapped GGUF

Models load via MemoryMappedFile with OS demand-paging — a 7 GB weights file becomes available in milliseconds. The OS page cache shares physical pages across multiple processes for free.

~50 ms

Native AOT experimental

Trimming-safe single-file publish via .NET Native AOT: source-generated JSON, annotated [DynamicallyAccessedMembers], rd.xml for preserved types. Startup drops from ~500 ms JIT to around 50 ms.

no cold start

Warm-up pipeline

Configurable WarmupOptions trigger a full JIT pre-compilation pass and pre-load CUDA kernels before traffic arrives. Readiness probes gate on completion — no first-request latency spike.

03 · usage

How to use

Two ways to ship it: run the standalone server, or embed the engine inside your own .NET app.

Option 1

Standalone ASP.NET server

Start the OpenAI-compatible API and a built-in browser chat in one command. Drop-in for any tool that already speaks OpenAI — no Docker Compose, no Open WebUI sidecar, no extra services.

  • OpenAI endpoints: /v1/chat/completions, /v1/completions, /v1/models, /v1/tokenize
  • Streaming via Server-Sent Events
  • Tool/function calling with JSON-schema-constrained arguments
  • Jinja2-subset chat template interpreter
  • Logprobs with OpenAI-compatible top_logprobs
  • Built-in browser chat UI at / (interactive stub tool calls) — pass --no-ui for API-only hosting
  • Hybrid CPU/GPU via --device gpu --gpu-layers N
$ dotllm serve llama-3.2-3b.Q4_K_M.gguf \
    --device gpu \
    --gpu-layers 32 \
    --port 8080

→ Loaded model (mmap, 1.9 GB)
→ CUDA backend ready (device 0)
→ Prompt cache enabled (4 sessions)
→ API listening on http://localhost:8080
→ Opening chat UI in browser...
dotllm serve → open browser, chat with streaming responses
Option 2

From your .NET app

Reference the NuGet packages, load a GGUF model, and stream tokens — or mount the OpenAI-compatible endpoints inside your own ASP.NET host.

Stream tokens from a console app

Open a GGUF file, build a TransformerModel and tokenizer, and let TextGenerator stream tokens as they are sampled.

using DotLLM.Engine;
using DotLLM.Models.Architectures;
using DotLLM.Models.Gguf;
using DotLLM.Tokenizers.Bpe;

using var gguf = GgufFile.Open("llama-3.2-1b.Q4_K_M.gguf");
var config = GgufModelConfigExtractor.Extract(gguf.Metadata);
using var model = TransformerModel.LoadFromGguf(gguf, config);
var tokenizer = GgufBpeTokenizerFactory.Load(gguf.Metadata);

var generator = new TextGenerator(model, tokenizer);

await foreach (var text in generator.GenerateStreamingAsync(
    "Explain dotLLM.",
    new InferenceOptions { MaxTokens = 128 }))
{
    Console.Write(text);
}

Host the OpenAI API in your ASP.NET app

Build a ServerOptions, let ServerStartup load the model, then call MapDotLLMEndpoints() on the web app.

using DotLLM.Server;

var options = new ServerOptions
{
    Model = "llama-3.2-3b.Q4_K_M.gguf",
    Device = "gpu",
    GpuLayers = 32,
    Port = 8080,
    PromptCacheEnabled = true,
    UsePaged = true,
};

using var state = ServerStartup.LoadModel(options.Model, options);
var app = ServerStartup.BuildApp(state, args: [], serveUi: true);

await app.RunAsync($"http://localhost:{options.Port}");
04 · deeper dive

More on features

Deeper dives into the sampling pipeline, speculative decoding, logprobs visualization, and the upcoming interpretability stack.

Composable Sampling Pipeline

Build your sampler from independent, reorderable steps. Each ISamplerStep transforms logits — chain them in any order, add custom steps, or constrain output to a schema.

  • Temperature, top-k, top-p, min-p, repetition / frequency / presence penalties
  • Structured output via IDecodingConstraint
  • Custom stop conditions with IStopCondition
var sampler = new SamplerPipeline(
    new TemperatureStep(0.8f),
    new TopKStep(40),
    new TopPStep(0.95f),
    new MinPStep(0.05f)
);

var options = new InferenceOptions
{
    Sampler = sampler,
    StopConditions = [new EosStop()],
    MaxTokens = 512
};

Speculative Decoding

A small, fast draft model proposes candidate tokens; the target model verifies them all in a single forward pass, accepting correct predictions and rolling back mismatches via KV-cache rollback. Same output, fewer forward passes.

  • Draft-verify-accept with modified rejection sampling
  • Constraint state rollback via IDecodingConstraint.Clone()
  • Configurable speculation depth
  • Best gains on long-form generation
dotllm serve llama-3.2-70b.Q4_K_M.gguf \
  --speculative-model llama-3.2-1b.Q8_0.gguf \
  --speculative-k 5 \
  --device gpu \
  --port 8080

Logprobs & confidence

OpenAI-compatible logprobs and top_logprobs on every chat completion. The built-in Chat UI uses them to render each generated token in a confidence color — you can see at a glance where the model was certain, where it was guessing, and where sampling noticeably changed the output.

dotLLM Chat UI — logprobs visualization with color-coded token confidence
Chat UI with logprobs — colors show per-token confidence

Interpretability Hooks Planned

Zero-cost hook points throughout the model. Capture activations at any layer, run logit lens projections — with zero overhead when disabled. Null-check guard, not event pattern. Landing in Phase 7.

  • Per-layer activation capture via IInferenceHook
  • Logit lens and tuned lens projection
  • Configurable HookPoint locations
var hook = new ActivationCaptureHook(
    HookPoint.AfterAttention,
    layers: [0, 12, 24]
);

await foreach (var token in model
    .GenerateAsync(prompt, hooks: [hook]))
{
    Console.Write(token);
}

// Inspect captured activations
var attn = hook.Activations[12];

Sparse Autoencoders Planned

First-class SAE support for mechanistic interpretability. Load a trained sparse autoencoder, attach it to any layer, and decompose activations into interpretable features — all within the same inference pipeline. Built on top of the hook system (Phase 7).

  • Load SAE weights from standard formats
  • Attach to any layer via the hook system
  • Decompose activations into sparse features
  • Feature steering and ablation studies
using var sae = await SparseAutoencoder
    .LoadAsync("sae-llama-layer12.safetensors");

var hook = new SaeHook(sae, layer: 12);

await foreach (var token in model
    .GenerateAsync(prompt, hooks: [hook]))
{
    Console.Write(token);
}

// Top activated features
foreach (var f in hook.TopFeatures(k: 10))
    Console.WriteLine($"  #{f.Index}: {f.Activation:F3}");
05 · model support

Supported Models & Architectures

A single parameterized TransformerBlock handles multiple model families via ModelConfig. This list is kept in sync with the engine roadmap✓ shipped ◷ planned.

Model families

  • Llama (1 / 2 / 3 / 3.2)
  • Mistral (sliding-window)
  • Phi
  • Qwen (2 / 2.5 / 3)
  • DeepSeek V2 / V3 (needs MLA)
  • SmolLM3 (NoPE + YARN)
  • Gemma 4
  • Mixture-of-Experts

Architecture features

  • GQA / MHA / MQA
  • Tiled (flash-style) attention
  • Sliding-window attention
  • RoPE position encoding
  • BPE · SentencePiece · HF tokenizer.json
  • Jinja2-subset chat templates
  • MLA (multi-head latent)
  • ALiBi position encoding

Quantization

  • FP16, Q8_0
  • Q4_0 / Q4_1, Q5_0 / Q5_1
  • Q4_K, Q5_K, Q6_K (incl. Q4_K_M)
  • KV-cache: Q8_0 / Q4_0 + mixed-precision window
  • Quantized paged KV-cache
  • Runtime quantization (FP16 → Q4_K_M)

Serving & tooling

  • OpenAI-compatible API + built-in Chat UI
  • Streaming (SSE)
  • Tool calling (Llama / Hermes / Mistral / Generic)
  • Constrained decoding (JSON / Schema / Regex / CFG)
  • Paged KV-cache + simple prompt cache
  • Speculative decoding
  • Logprobs (OpenAI-compatible)
  • Warm-up · Native AOT (experimental)
  • Continuous batching & cross-request prefix sharing
  • Rate limiting · OpenTelemetry metrics & tracing
  • LoRA adapters
  • Inference hooks · logit lens · SAE
  • Multi-GPU (tensor / pipeline parallelism)
06 · internals

Heavily documented

A layered architecture with pluggable backends, backed by deep per-subsystem docs, agent-reviewed pull requests, and a broad test suite.

Layered architecture

Clean separation of concerns with pluggable backends and zero-allocation inference paths. Each project ships as a separate NuGet — pull in only what you need.

DotLLM.Server OpenAI-compatible API, rate limiting, SSE streaming
▼ ▼ ▼
DotLLM.Engine KV-cache, continuous batching, samplers, speculative decoding
▼ ▼ ▼
DotLLM.Tokenizers BPE, SentencePiece, chat templates
DotLLM.Models GGUF loader, architectures, LoRA
▼ ▼ ▼
DotLLM.Core ITensor, IBackend, IModel, IAttentionMechanism, diagnostics, config
▼ ▼ ▼
DotLLM.Cpu SIMD kernels, TensorPrimitives
DotLLM.Cuda P/Invoke → native C/CUDA lib

In-depth documentation

Every subsystem has a dedicated markdown doc in engine/docs/ — design notes, file-level cross-references, and the why behind each decision. Twenty-four docs covering everything from GGUF parsing to CUDA kernel conventions. Read before you touch the code.

engine/docs/
├── ARCHITECTURE.md
├── ATTENTION.md
├── AOT.md
├── BENCHMARKS.md
├── CONSTRAINED_DECODING.md
├── CUDA.md
├── DIAGNOSTICS.md
├── GGUF_FORMAT.md
├── GPU.md
├── KV_CACHE.md
├── LORA.md
├── MODEL_CONFIG.md
├── MULTI_GPU.md
├── POSITION_ENCODING.md
├── QUANTIZATION.md
├── ROADMAP.md
├── SAMPLING.md
├── SCHEDULING.md
├── SERVER.md
├── SPECULATIVE.md
├── TELEMETRY.md
├── TOKENIZERS.md
├── TOOL_CALLING.md
└── WARMUP.md

Agent-reviewed pull requests

Every PR ships with a comprehensive agent-to-agent discussion: design review, trade-offs, risks, and a trace-back to the roadmap step it closes. The full decision log is in the open — anyone can replay how a feature came to be.

## Summary
Step 43: speculative decoding with
draft-verify-accept + KV-cache rollback.

## Agent review
> concern: the rejection sampling uses the
> target model's posterior, but the draft
> distribution isn't re-normalized after
> the accepted prefix — rejected tokens
> leak probability mass.
>
> ↳ fix: divide by (1 - p_accepted) before
>    re-sampling. Confirmed in
>    test_models_speculative.py.

> followup: constraint state also needs to
> roll back — added IDecodingConstraint
> .Clone() for draft branch forking.

Closes #43

Tests & Python tooling

Over 100 C# test files cover kernels, samplers, tokenizers, constraints, models, and KV-cache — both unit and integration. A Python suite in engine/scripts/ runs end-to-end integration tests on real GGUF models and cross-runtime benchmarks against llama.cpp.

engine/tests/
├── DotLLM.Tests.Unit/           86 files
│   ├── Cpu · Cuda · Engine
│   ├── Models · Server
│   └── Tensors · Tokenizers
└── DotLLM.Tests.Integration/    22 files
    ├── Cpu · Engine · Fixtures
    └── Models · Tokenizers

engine/scripts/                  16 files
├── test_models*.py               GGUF E2E
│   (json · schema · regex · grammar
│    · tools · speculative · warmup · aot)
├── test_server.py                HTTP API
└── bench_*.py                    vs llama.cpp