V3.6 Internals

Local-first
by design.

Every component runs on your machine. No cloud dependency in the data path. Five independently replaceable layers — memory, cache, compression, retrieval, and the MCP interface — all composable.

Layer Diagram

Five layers.
All replaceable.

Data flows top-down on ingress, bottom-up on retrieval. Each layer has a clean interface — swap any one without touching the others.

Entry point

MCP + CLI Interface

17+ tool integrations · --json structured output · 26 data commands

MCP CLI JSON API

Cache Engine

Exact-match default · Semantic opt-in · $0 on hit

Exact Semantic LRU

Compression Engine

60–95% token reduction · Byte-exact reversible · Extractive for code/JSON

Extractive Reversible KV-cache

Memory Store

Tiered: hot graph · warm vector · cold compressed

CozoDB LanceDB SQLite

Foundation

Mathematical Retrieval

Fisher-Rao metric · Riemannian lifecycle · 3 arXiv proofs

Fisher-Rao Manifold Peer-reviewed

Layer 05 — MCP + CLI Interface

Agent-native from
the ground up.

The MCP server exposes all SLM capabilities to Claude Code, Cursor, Windsurf, and every other tool that speaks the Model Context Protocol. No wrappers, no polling — structured tool calls with typed responses.

The 26 CLI commands all accept --json, returning a consistent envelope: { success, command, version, data, next_actions }. No text scraping required in agent pipelines.

17+ integrations via MCP — Claude Code, Cursor, ChatGPT Desktop, Windsurf, Zed, and more
26 data commands with --json structured output flag
Zero text scraping — typed envelopes, consistent schema across all commands
stdin/stdout safe — pipeable to jq, shell scripts, CI/CD pipelines

MCP + CLI — structured output

# structured recall — agent-parseable envelope $ slm recall "JWT token lifetime" --json "success": true, "command": "recall", "version": "3.6.0", "data": "results": [ { "text": "JWT tokens refresh every 15m", "confidence": 0.97, "id": 1842391 ], "query_time_ms": 4.2, "cache_hit": false , "next_actions": ["remember", "forget", "update"] }

Layer 04 — Cache Engine

$0 on a hit.
Always safe by default.

Exact-match caching uses a SHA-256 hash of the full prompt. A hit returns the cached response immediately — no LLM call, no token cost, no latency. This is the default mode because it has no false positives.

Semantic cache is explicitly opt-in. Unlike naive implementations that apply a global threshold (e.g., 0.95 cosine similarity for all prompts), SLM learns per-prompt similarity thresholds. A threshold tuned on code questions doesn't get applied to factual recall. LRU eviction with configurable TTL keeps the cache bounded.

SHA-256 exact-match — zero false positives, sub-millisecond lookup
Per-prompt thresholds — semantic cache learns each prompt class independently
LRU eviction with configurable TTL — cache stays bounded
$0 token cost on every cache hit — compound savings over thousands of sessions

cache --status

# inspect cache health and savings $ slm cache --status
engine: exact + semantic (opt-in active) entries: 2,847 cached prompts hit rate: 73.4% (last 30 days) tokens saved: 1,204,391 cost saved: $16.38 eviction: LRU · TTL 7d size on disk: 18.4 MB
✓ Cache healthy · exact-match default active

Layer 03 — Compression Engine

60–95% fewer tokens.
Zero information loss.

On a cache miss, prompts compress before being forwarded to the LLM. For code and JSON payloads, SLM uses an extractive path — it preserves keys and function signatures while compressing values. The result is byte-exact reversible: decompress the response and you get the original byte-for-byte.

Proper prefix ordering aligns compressed prompts with Anthropic's KV-cache prefix boundary. This stacks the provider's native 90% cache discount on top of SLM's own compression savings — two cost-reduction mechanisms working together without configuration.

Extractive for code/JSON — keys and signatures preserved, values compressed
Byte-exact reversibility — guaranteed zero loss on structured payloads
60–95% compression on typical structured prompts
KV-cache alignment — prefix ordering achieves ~90% provider cache rate

compression report

# check compression stats for last session $ slm compress --report
mode: extractive (code/JSON detected) input tokens: 8,420 output tokens: 684 (91.9% reduction) reversible: yes · byte-exact verified prefix align: ✓ KV-cache boundary set provider hit: 90.2% (Anthropic KV-cache)
✓ Compression active · 2 engines stacked

Layer 02 — Memory Store

Three tiers.
Self-organizing.

The memory store is tiered across three databases, each optimized for a different access pattern. Hot memories live in CozoDB — a property graph database that excels at relationship-rich, frequently-accessed facts and can traverse connections between memories in a single query.

Warm memories move to LanceDB, a columnar vector index designed for semantic retrieval. Cold memories compress to SQLite with zstd — 32× compression on rarely-accessed archival facts. Tiering is automatic, driven by statistical access patterns, not manually configured TTLs.

Hot tier — CozoDB: property graph · recent & connected memories · <5ms access
Warm tier — LanceDB: columnar vector index · semantic retrieval · <20ms access
Cold tier — SQLite+zstd: 32× compression · archival facts · <100ms access
Automatic tiering — statistical access patterns drive promotion and demotion

memory store status

# inspect tiered memory store $ slm store --status
┌ HOT · CozoDB graph │ entries: 12,847 access <5ms │ edges: 34,201 (relationship links) ├ WARM · LanceDB vectors │ entries: 94,120 access <20ms │ index: 384-dim HNSW └ COLD · SQLite + zstd entries: 893,024 access <100ms ratio: 32× compression
✓ 1,000,391 facts · auto-tier active

Layer 01 — Mathematical Retrieval

Geometry,
not heuristics.

Cosine similarity treats memory retrieval as a dot product in flat vector space. SLM treats it as a distance problem on a Riemannian manifold — the statistical manifold of probability distributions over memory content.

The Fisher-Rao metric defines geodesic distances on this manifold. Confidence scores weight the distance: high-certainty memories rank higher regardless of recency. The geometry improves with use — the manifold structure captures relationships that cosine similarity misses entirely. Three peer-reviewed arXiv papers prove the approach.

d_FR(p, q) = arccos( Σ √(p_i · q_i) )

Fisher-Rao geodesic — correct distance on the statistical manifold
Confidence-weighted — certainty scores integrated into retrieval ranking
Improves with use — geometry strengthens frequently-recalled paths
3 arXiv papers — peer-reviewed proofs, not engineering heuristics

Read the papers → arXiv:2603.14588 →

retrieval — fisher-rao trace

# trace retrieval path on manifold $ slm recall "auth timeout" --trace
metric: Fisher-Rao (Riemannian manifold) manifold: 512-dim statistical space candidates: 14,203 (hot + warm tiers)
rank 1 · d_FR = 0.083 → "JWT tokens expire after 15m" conf:0.97 rank 2 · d_FR = 0.211 → "Session timeout set to 30m" conf:0.84 rank 3 · d_FR = 0.394 → "Refresh tokens valid 7d" conf:0.79
✓ retrieval: 6.1ms · method: fisher-rao

Data Flow

What happens when
you run `slm recall`.

A single recall command traverses all five layers in under 10ms on a cold cache. Here is every step, in order.

CLI receives query Layer 5

The MCP server or CLI dispatcher receives slm recall "JWT token lifetime". If --json is present, the response envelope is prepared. Input is normalized and validated before passing to the cache layer.

Exact-match cache check Layer 4

The cache engine computes SHA-256 of the normalized query string and checks the LRU cache. A hit returns the stored response immediately — elapsed time under 1ms, $0 token cost. On a miss, the query passes to the compression engine. Semantic cache (if enabled) runs in parallel with a learned per-prompt threshold.

Compression context prepared Layer 3

On a cache miss, the compression engine decompresses any cold-tier context that will be included in the retrieval window. It also prepares the prefix ordering for KV-cache alignment — ensuring compressed context begins at the correct boundary for Anthropic's native caching to activate on forwarded calls.

Memory store scanned across tiers Layer 2

The memory store simultaneously queries the hot CozoDB graph (relationship traversal), warm LanceDB index (approximate nearest-neighbor search), and cold SQLite entries (decompressed on-demand). Candidates from all three tiers are collected for ranking. Results from the hot graph carry relationship-context that flat vector search cannot surface.

Fisher-Rao distance computed, results ranked Layer 1

The mathematical retrieval engine computes Fisher-Rao geodesic distances between the query distribution and each candidate on the statistical manifold. Distances are weighted by confidence scores — a high-certainty memory at geodesic distance 0.08 outranks a low-certainty memory at 0.04. The ranked list returns up the stack as a structured JSON response within the envelope prepared at Layer 5.

Storage Detail

Hot, warm, cold.

Three engines. Each one purpose-built for its access pattern. Automatic promotion and demotion — you never configure tier boundaries manually.

	Hot	Warm	Cold
Engine	`CozoDB`	`LanceDB`	`SQLite + zstd`
Format	Property graph	Columnar vectors	Compressed rows
Access time	< 5ms	< 20ms	< 100ms
Compression	None	None	32× (zstd)
Use for	Recent, related facts — graph traversal across memory connections	Semantic search — approximate nearest-neighbor over vector embeddings	Rarely accessed facts, archival storage, long-term memory at minimal cost

Open Source — AGPL v3

All five layers.
All open source.

Read the code, fork it, audit it. Every layer — MCP interface, cache engine, compression engine, memory store, and mathematical retrieval — is published under AGPL v3. No closed dependencies in the data path.

View on GitHub Read the Papers

AGPL v3 · Local-first architecture · No cloud dependency in the data path

Local-firstby design.

Five layers.All replaceable.

Agent-native fromthe ground up.

$0 on a hit.Always safe by default.

60–95% fewer tokens.Zero information loss.

Three tiers.Self-organizing.

Geometry,not heuristics.

What happens whenyou run slm recall.