Pure C# pipeline
Tokenizer, sampler, scheduler, kernels — all C#. No Python, no foreign runtime, no llama.cpp wrapper. The entire inference path runs on .NET 10 with full debugger and profiler support. Just dotnet run.
Experimental dotLLM is a learning project — not production ready yet. Expect incomplete features, rough edges, and breaking changes.
High-performance LLM inference engine written natively in C#/.NET. Not a wrapper — a ground-up implementation.
An open-source LLM inference engine built from the ground up in C# targeting .NET 10. It runs transformer-based models with SIMD-optimized CPU and CUDA GPU backends — no Python, no wrappers, no bindings to external runtimes.
It's also a learning project: a place to explore how modern inference engines work end to end, from GGUF loading and quantized matmul kernels to paged KV-cache and speculative decoding.
And it's an ongoing experiment in AI-assisted coding — Claude Code, Gemini, Codex — to see how far modern coding agents can push a complex systems project when driven by a human who knows the domain. Not vibe-coding.
FP16, Q8_0, Q4_0/1, Q5_0/1, Q4_K, Q5_K, Q6_K (incl. Q4_K_M) — with fused dequantize-and-accumulate matmul kernels operating directly on quantized blocks. Every SIMD path has a verified scalar fallback.
Parallel function calling with model-specific parsers for Llama, Hermes, Mistral, and a Generic fallback. JSON-schema-constrained arguments via structured output — no malformed tool calls, no retries.
FSM/PDA-based logit masking applied at every decoding step. Guarantees syntactically valid JSON, JSON Schema, regex, or GBNF grammar output — no retries, no post-processing — via IDecodingConstraint.
Block-allocated cache with reference counting and copy-on-write — eliminates fragmentation for long contexts. Q8_0/Q4_0 KV quantization with a mixed-precision window extends effective context 2–4×.
Run models larger than your VRAM by splitting layers between GPU and CPU. Pass --gpu-layers N and dotLLM runs the first N transformer layers on CUDA and the rest on the CPU backend, with automatic tensor transfer at the layer boundary.
Build sampling pipelines from independent ISamplerStep components — temperature, top-k, top-p, min-p, repetition/frequency/presence penalties. Steps are reorderable and individually testable.
Hash-based prefix cache for multi-turn conversations — the matching prefix skips prefill and only new tokens are processed. LRU eviction, configurable session count, and dramatically lower time-to-first-token on chat.
A smaller draft model proposes candidate tokens, the target model verifies in a single forward pass. Automatic KV-cache rollback on mismatch and constraint-aware — identical output, fewer forward passes.
Iteration-level scheduling with dynamic request arrival — new prompts will join the batch without waiting. Priority queues and preemption policies to maximize GPU utilization across concurrent requests. Landing in Phase 9.
Load pre-trained sparse autoencoders, attach to any layer, and decompose activations into interpretable features — plus feature steering and ablation, all inside the same inference pipeline. Built on the Phase 7 hook system.
Tensor parallelism via NCCL with explicit DevicePlacement on every tensor operation. The abstractions are baked into the core from day one, ready for the implementation to land.
Tokenizer, sampler, scheduler, kernels — all C#. No Python, no foreign runtime, no llama.cpp wrapper. The entire inference path runs on .NET 10 with full debugger and profiler support. Just dotnet run.
AVX2/AVX-512 vectorized CPU kernels via System.Runtime.Intrinsics and TensorPrimitives. CUDA backend through PTX kernels loaded via the Driver API — no native shared library to build.
All tensor data lives in NativeMemory.AlignedAlloc with 64-byte alignment for AVX-512. Zero managed allocations on the hot path; Server GC runs in SustainedLowLatency mode during generation.
Models load via MemoryMappedFile with OS demand-paging — a 7 GB weights file becomes available in milliseconds. The OS page cache shares physical pages across multiple processes for free.
Trimming-safe single-file publish via .NET Native AOT: source-generated JSON, annotated [DynamicallyAccessedMembers], rd.xml for preserved types. Startup drops from ~500 ms JIT to around 50 ms.
Configurable WarmupOptions trigger a full JIT pre-compilation pass and pre-load CUDA kernels before traffic arrives. Readiness probes gate on completion — no first-request latency spike.
Two ways to ship it: run the standalone server, or embed the engine inside your own .NET app.
Start the OpenAI-compatible API and a built-in browser chat in one command. Drop-in for any tool that already speaks OpenAI — no Docker Compose, no Open WebUI sidecar, no extra services.
/v1/chat/completions, /v1/completions, /v1/models, /v1/tokenizetop_logprobs/ (interactive stub tool calls) — pass --no-ui for API-only hosting--device gpu --gpu-layers N$ dotllm serve llama-3.2-3b.Q4_K_M.gguf \
--device gpu \
--gpu-layers 32 \
--port 8080
→ Loaded model (mmap, 1.9 GB)
→ CUDA backend ready (device 0)
→ Prompt cache enabled (4 sessions)
→ API listening on http://localhost:8080
→ Opening chat UI in browser...
Reference the NuGet packages, load a GGUF model, and stream tokens — or mount the OpenAI-compatible endpoints inside your own ASP.NET host.
Open a GGUF file, build a TransformerModel and tokenizer, and let TextGenerator stream tokens as they are sampled.
using DotLLM.Engine;
using DotLLM.Models.Architectures;
using DotLLM.Models.Gguf;
using DotLLM.Tokenizers.Bpe;
using var gguf = GgufFile.Open("llama-3.2-1b.Q4_K_M.gguf");
var config = GgufModelConfigExtractor.Extract(gguf.Metadata);
using var model = TransformerModel.LoadFromGguf(gguf, config);
var tokenizer = GgufBpeTokenizerFactory.Load(gguf.Metadata);
var generator = new TextGenerator(model, tokenizer);
await foreach (var text in generator.GenerateStreamingAsync(
"Explain dotLLM.",
new InferenceOptions { MaxTokens = 128 }))
{
Console.Write(text);
}
Build a ServerOptions, let ServerStartup load the model, then call MapDotLLMEndpoints() on the web app.
using DotLLM.Server;
var options = new ServerOptions
{
Model = "llama-3.2-3b.Q4_K_M.gguf",
Device = "gpu",
GpuLayers = 32,
Port = 8080,
PromptCacheEnabled = true,
UsePaged = true,
};
using var state = ServerStartup.LoadModel(options.Model, options);
var app = ServerStartup.BuildApp(state, args: [], serveUi: true);
await app.RunAsync($"http://localhost:{options.Port}");
Deeper dives into the sampling pipeline, speculative decoding, logprobs visualization, and the upcoming interpretability stack.
Build your sampler from independent, reorderable steps. Each ISamplerStep transforms logits — chain them in any order, add custom steps, or constrain output to a schema.
IDecodingConstraintIStopConditionvar sampler = new SamplerPipeline(
new TemperatureStep(0.8f),
new TopKStep(40),
new TopPStep(0.95f),
new MinPStep(0.05f)
);
var options = new InferenceOptions
{
Sampler = sampler,
StopConditions = [new EosStop()],
MaxTokens = 512
};
A small, fast draft model proposes candidate tokens; the target model verifies them all in a single forward pass, accepting correct predictions and rolling back mismatches via KV-cache rollback. Same output, fewer forward passes.
IDecodingConstraint.Clone()dotllm serve llama-3.2-70b.Q4_K_M.gguf \
--speculative-model llama-3.2-1b.Q8_0.gguf \
--speculative-k 5 \
--device gpu \
--port 8080
OpenAI-compatible logprobs and top_logprobs on every chat completion. The built-in Chat UI uses them to render each generated token in a confidence color — you can see at a glance where the model was certain, where it was guessing, and where sampling noticeably changed the output.
Zero-cost hook points throughout the model. Capture activations at any layer, run logit lens projections — with zero overhead when disabled. Null-check guard, not event pattern. Landing in Phase 7.
IInferenceHookHookPoint locationsvar hook = new ActivationCaptureHook(
HookPoint.AfterAttention,
layers: [0, 12, 24]
);
await foreach (var token in model
.GenerateAsync(prompt, hooks: [hook]))
{
Console.Write(token);
}
// Inspect captured activations
var attn = hook.Activations[12];
First-class SAE support for mechanistic interpretability. Load a trained sparse autoencoder, attach it to any layer, and decompose activations into interpretable features — all within the same inference pipeline. Built on top of the hook system (Phase 7).
using var sae = await SparseAutoencoder
.LoadAsync("sae-llama-layer12.safetensors");
var hook = new SaeHook(sae, layer: 12);
await foreach (var token in model
.GenerateAsync(prompt, hooks: [hook]))
{
Console.Write(token);
}
// Top activated features
foreach (var f in hook.TopFeatures(k: 10))
Console.WriteLine($" #{f.Index}: {f.Activation:F3}");
A single parameterized TransformerBlock handles multiple model families via ModelConfig. This list is kept in sync with the engine roadmap — ✓ shipped ◷ planned.
A layered architecture with pluggable backends, backed by deep per-subsystem docs, agent-reviewed pull requests, and a broad test suite.
Clean separation of concerns with pluggable backends and zero-allocation inference paths. Each project ships as a separate NuGet — pull in only what you need.
Every subsystem has a dedicated markdown doc in engine/docs/ — design notes, file-level cross-references, and the why behind each decision. Twenty-four docs covering everything from GGUF parsing to CUDA kernel conventions. Read before you touch the code.
engine/docs/
├── ARCHITECTURE.md
├── ATTENTION.md
├── AOT.md
├── BENCHMARKS.md
├── CONSTRAINED_DECODING.md
├── CUDA.md
├── DIAGNOSTICS.md
├── GGUF_FORMAT.md
├── GPU.md
├── KV_CACHE.md
├── LORA.md
├── MODEL_CONFIG.md
├── MULTI_GPU.md
├── POSITION_ENCODING.md
├── QUANTIZATION.md
├── ROADMAP.md
├── SAMPLING.md
├── SCHEDULING.md
├── SERVER.md
├── SPECULATIVE.md
├── TELEMETRY.md
├── TOKENIZERS.md
├── TOOL_CALLING.md
└── WARMUP.md
Every PR ships with a comprehensive agent-to-agent discussion: design review, trade-offs, risks, and a trace-back to the roadmap step it closes. The full decision log is in the open — anyone can replay how a feature came to be.
## Summary
Step 43: speculative decoding with
draft-verify-accept + KV-cache rollback.
## Agent review
> concern: the rejection sampling uses the
> target model's posterior, but the draft
> distribution isn't re-normalized after
> the accepted prefix — rejected tokens
> leak probability mass.
>
> ↳ fix: divide by (1 - p_accepted) before
> re-sampling. Confirmed in
> test_models_speculative.py.
> followup: constraint state also needs to
> roll back — added IDecodingConstraint
> .Clone() for draft branch forking.
Closes #43
Over 100 C# test files cover kernels, samplers, tokenizers, constraints, models, and KV-cache — both unit and integration. A Python suite in engine/scripts/ runs end-to-end integration tests on real GGUF models and cross-runtime benchmarks against llama.cpp.
engine/tests/
├── DotLLM.Tests.Unit/ 86 files
│ ├── Cpu · Cuda · Engine
│ ├── Models · Server
│ └── Tensors · Tokenizers
└── DotLLM.Tests.Integration/ 22 files
├── Cpu · Engine · Fixtures
└── Models · Tokenizers
engine/scripts/ 16 files
├── test_models*.py GGUF E2E
│ (json · schema · regex · grammar
│ · tools · speculative · warmup · aot)
├── test_server.py HTTP API
└── bench_*.py vs llama.cpp