Building memory that learns: a technical deep dive

This is Part 2 of a series on building AI memory that learns. Part 1 covered the landscape and why feedback is the missing primitive. This post covers how the engine works.

Architecture

Memory Layer has two layers and they change at different rates:

Memory layers

This matters because platforms evolve fast. I’ve rewritten the Claude Code integration twice already as their plugin and task system changed. The core engine hasn’t changed.

Data Model

A memory looks like this:

@dataclass
class Memory:
    id: str                    # UUID
    content: str               # The actual text
    category: MemoryCategory   # architecture, gotcha, pattern, etc.

    # Where it applies
    project: str | None        # None = global

    # Learning signals
    outcome_score: float       # -1.0 to 1.0, starts at 0.0
    use_count: int             # Times retrieved
    confidence: float          # How reliable the source (0-1)

    # Timestamps
    created_at: datetime
    updated_at: datetime

    # For semantic search
    embedding: bytes           # 384-dim vector

The 16 categories exist because generic “memories” are hard to rank. If you ask about errors, a troubleshooting memory should rank higher than an architecture memory, even if the semantic similarity is the same.

Nine categories do most of the work:

architecture - system design decisions
convention - coding standards (“we use snake_case”)
decision - why we chose X over Y
pattern - reusable code patterns
gotcha - non-obvious behaviors, traps
workaround - temporary fixes
troubleshooting - how to fix specific errors
command - useful CLI commands
preference - personal preferences

The remaining (dependency, environment, coding_style, tool_preference, context, todo, general) handle edge cases.

Storage

I prototyped with pgvector, Qdrant and Neo4j before settling on SQLite.

pgvector: Real vector indexing, scales to millions. But requires PostgreSQL running, ~500MB RAM overhead. Overkill for a personal memory store.

Qdrant: Purpose-built for vectors, billion-scale. But another service to manage. Wrong abstraction level for a local, single-user tool.

Neo4j: models relationships elegantly. I seriously considered it for dependency tracking. In practice, foreign keys covered most use cases. The operational cost wasn’t worth it.

SQLite + FTS5 + Python cosine: Zero config, single file, ships with Python. No vector indexing but at 100k memories brute-force cosine similarity still takes <100ms. The “limitations” don’t matter at personal scale.

Decision framework: what’s the simplest thing that handles 10x my current load? SQLite.

The schema:

CREATE TABLE memories (
    id TEXT PRIMARY KEY,
    content TEXT NOT NULL,
    category TEXT NOT NULL,
    project TEXT,

    outcome_score REAL DEFAULT 0.0,
    use_count INTEGER DEFAULT 0,
    confidence REAL DEFAULT 1.0,

    embedding BLOB,
    created_at TEXT,
    updated_at TEXT
);

-- BM25 keyword search
CREATE VIRTUAL TABLE memories_fts USING fts5(
    content, category,
    content='memories'
);

CREATE INDEX idx_outcome ON memories(outcome_score);
CREATE INDEX idx_project ON memories(project);

I store embeddings as BLOBs and do similarity search in Python. Not as elegant as pgvector but it works and keeps the dependency footprint small.

Why FTS5?

Vector search is great for semantic similarity-“authentication” matches “login flow” even though the words differ. But sometimes you want exact matches. If someone searches “connection reset error,” that should match memories containing those exact words, even if the embedding thinks “network timeout” is semantically closer.

BM25 (via FTS5) handles that. I run keyword search and vector search side by side and combine the scores.

Embeddings

I use sentence-transformers/all-MiniLM-L6-v2:

384 dimensions (small enough to store as a BLOB without pain)
Fast inference (~10ms per embedding on CPU)
Good quality for technical text

First run downloads the model (~100MB). Only once.

The 5-Signal Retrieval Formula

This is the core of Memory Layer:

Retrieval flow

def score_memory(memory, query_embedding, detected_category=None):
    # 1. Semantic similarity (35%)
    semantic = cosine_similarity(query_embedding, memory.embedding)

    # 2. Outcome score (25%) - the differentiator
    # Normalize from [-1, 1] to [0, 1]
    outcome = (memory.outcome_score + 1) / 2

    # 3. Recency (15%)
    days_old = (now - memory.updated_at).days
    recency = exp(-days_old / 30)  # 30-day half-life

    # 4. Frequency (15%)
    frequency = log(memory.use_count + 1) / log(100)
    frequency = min(frequency, 1.0)

    # 5. Confidence (10%)
    confidence = memory.confidence

    base = (0.35 * semantic +
            0.25 * outcome +
            0.15 * recency +
            0.15 * frequency +
            0.10 * confidence)

    # Category boost
    if detected_category == memory.category:
        base *= CATEGORY_BOOSTS[memory.category]

    return base

Why These Weights? (And What I Tried First)

The weights came from iteration, not theory.

Version 1: Pure semantic (100%) Results: 70% precision. Problem: equally-relevant memories ranked randomly. A tip that burned me yesterday ranked same as one that saved me.

Version 2: Semantic (50%) + Outcome (50%) Results: 85% precision but weird edge cases. A memory with one “worked” dominated everything. New memories never surfaced because they had no outcome data yet.

Version 3: Added recency and frequency Semantic (40%) + Outcome (30%) + Recency (15%) + Frequency (15%) Results: 88% precision. Better but new memories still struggled.

Version 4 (current): Added confidence, rebalanced Semantic (35%) + Outcome (25%) + Recency (15%) + Frequency (15%) + Confidence (10%)

5-SIGNAL WEIGHT BREAKDOWN

Semantic    ████████████████████████████████████  35%  ← Core relevance
Outcome     ██████████████████████████            25%  ← Learned effectiveness
Recency     ████████████████                      15%  ← Fresh context
Frequency   ████████████████                      15%  ← Usage patterns
Confidence  ██████████                            10%  ← Source reliability
            ─────────────────────────────────────
            0%        25%        50%        75%   100%

The confidence signal solved the cold-start problem. New memories I explicitly add get confidence=1.0. Auto-extracted memories get 0.7-0.9. This gives new explicit memories a slight boost until they accumulate outcome data.

Why 35% semantic, not 40%? At 40%, semantic similarity dominated too much. A perfect keyword match with bad outcome history still ranked high. 35% keeps relevance primary but lets outcome data actually matter.

Why only 25% outcome? I wanted outcome to be the tiebreaker, not the dictator. At 30%+, a memory with two “worked” ratings dominated over a semantically-closer memory with no ratings. That felt wrong-relevance should still win when outcome data is sparse.

Category Boosts

When you ask about errors, troubleshooting memories should rank higher:

Query: "Why is the API returning 500 errors?"
Detected intent: troubleshooting

Before boost:
  [pattern] REST API structure           0.72
  [troubleshooting] Connection pool fix  0.68

After boost (troubleshooting × 1.5):
  [troubleshooting] Connection pool fix  1.02  ← promoted
  [pattern] REST API structure           0.72

The boosts:

troubleshooting: 1.5x
gotcha: 1.4x
decision: 1.4x
pattern, convention: 1.3x
architecture, command: 1.2x
workaround: 1.1x
everything else: 1.0x

Outcome Learning

When you mark a memory as “worked” or “failed”:

def record_outcome(memory_id, result):
    memory = get_memory(memory_id)

    deltas = {
        "worked": +0.2,
        "failed": -0.3,
        "partial": +0.05
    }

    memory.outcome_score += deltas[result]
    memory.outcome_score = clamp(memory.outcome_score, -1.0, 1.0)
    memory.use_count += 1

The asymmetry matters. If “worked” and “failed” had equal magnitude, a memory could oscillate forever. With −0.3 vs +0.2, consistently bad memories sink fast:

Initial:        0.0
failed:        -0.3
failed:        -0.6  ← effectively hidden from results

vs.

Initial:        0.0
worked:         0.2
failed:        -0.1  ← one failure almost erases two successes

After a few weeks, memories with scores below −0.5 stop appearing in results. I don’t delete them-they’re archived in case the scoring was wrong-but they’re effectively gone.

Edge Cases

The “always partial” memory

Some memories are context-dependent. “Use Redis for caching” works for simple cases, fails for complex queries. Users kept marking it “partial.”

Problem: +0.05 per use meant it slowly climbed to high scores despite being unreliable. Solution: memories that haven’t been used in 30 days drift toward 0. Prevents “partial” spam from inflating scores.

The revenge downvote

User has a bad day, marks five memories “failed” in frustration.

Problem: Legitimate memories get nuked. Solution: Rate limiting on negative feedback. Max 3 “failed” ratings per hour from the same session. No limit on “worked”-positive feedback doesn’t have the same abuse potential.

The duplicate memory

User adds “use snake_case” and “always use snake_case for Python.” Both are basically the same.

Problem: Search returns both, wasting context window. Solution: Semantic deduplication at insert time. If a new memory has >0.92 cosine similarity to an existing one, prompt for merge instead of creating a duplicate.

The stale project memory

Memory says “our API uses OAuth 1.0” but the project migrated to OAuth 2.0 six months ago.

Problem: Outdated memories poison the context. Solution: Project memories inherit staleness from git activity. If the project directory hasn’t had commits in 90 days, those memories get a recency penalty.

The CLI

# Add a memory
mem add "Always use snake_case for Python functions" -c convention

# Search
mem search "naming conventions" --limit 5

# Record outcome
mem outcome mem_abc123 worked

# Get project context
mem context --project /path/to/project

# Stats
mem stats

# Start servers
mem serve --rest --port 8080   # Web UI + REST API
mem serve --mcp                 # MCP server for Cursor/OpenCode

The Web UI

mem serve --rest --port 8080 starts a web interface at localhost:8080.

Dashboard: Total memories, active vs archived, average outcome score. Color-coded category breakdown. Recent memories.

Memories view: Filterable by category, project or text. Sortable columns. Click a memory to see details and record outcomes.

Search: Two modes. Semantic search finds conceptually related memories (“auth” matches “JWT tokens”). Keyword search finds exact matches.

Tasks: Shows tasks from Beads (.beads/ directories) and Claude Code (~/.claude/todos/). Source badges distinguish them. Click to see memories linked to each task.

Theme toggle: Light/dark mode.

Memory Layer Dashboard

Memories List with Categories and Scores

Memory Details with Outcome Recording

Search with Semantic and Keyword Modes

What Your Database Looks Like

After a few weeks:

$ mem stats

Total:     247 memories
Active:    231
Archived:  16 (score < -0.5)

By category:
  troubleshooting   52 (21%)
  convention        45 (18%)
  pattern           38 (15%)
  decision          34 (14%)
  gotcha            28 (11%)
  architecture      22 (9%)
  command           15 (6%)
  workaround        10 (4%)
  preference         3 (1%)

Outcome scores:
  High (>0.5):     43
  Neutral:        188
  Low (<-0.5):     16 (archived)

Database size: 12.4 MB

Results

After 6-12 weeks:

Retrieval quality: Precision went from ~70% (vector similarity only) to ~90%. The memories that kept helping surfaced first. The ones that didn’t faded away.

Retrieval precision improvement over 6-12 weeks

Performance: P50 retrieval is 80ms. P95 is 150ms. SQLite handles it fine.

Storage: 12MB for 250 memories. Projected ~50MB for a year of use. Not a concern.

Token savings: ~50% reduction in context re-explanation at session start. Modest but real.

Time savings: Debugging familiar issues dropped from 10-15 minutes to 2-3 minutes. This is where the real value showed up.

Lessons

SQLite is enough. I kept thinking I’d need to upgrade to something more sophisticated. I haven’t.

Hybrid search matters. Pure vectors miss obvious keyword matches. BM25 + vectors beats either alone.

Categories create real value. The ability to boost troubleshooting memories when someone asks about errors makes a noticeable difference in relevance.

Asymmetric scoring feels right. Failures should cost more. Users intuitively agree when I explain it.

The Web UI builds trust. Seeing your memories visualized-especially seeing which ones have high scores-makes the system feel less like a black box.

Series navigation:

Memory Layer is open source at github.com/runtimenoteslabs/memory-layer.