DeepSeek Engram: Conditional Memory via Scalable N-gram Lookup

Paper: Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models Repository: deepseek-ai/Engram Released: January 2026

Overview

DeepSeek Engram Architecture

Problem Statement

Traditional LLMs waste computational resources by using neural networks to “re-derive” static factual knowledge on every inference. This creates:

Core Innovation: Conditional Memory

Engram introduces conditional memory as a complementary axis to neural computation:

Approach Purpose Complexity Storage
Neural (MoE) Dynamic reasoning O(n) HBM/GPU
Engram Static knowledge lookup O(1) System RAM

Architecture Components

1. Tokenizer Compression

Compresses equivalent tokens to canonical forms:

2. N-gram Extractor

Extracts sliding windows of tokens (n-grams) from input sequence for hash-based lookup.

3. Multi-Head Hashing

Uses K distinct hash heads per n-gram order to reduce collisions:

memory_vector = concat(
    hash_head_1(ngram) → embedding_1,
    hash_head_2(ngram) → embedding_2,
    ...
    hash_head_K(ngram) → embedding_K
)

4. Embedding Lookup Table

5. Context-Aware Gating

The gating mechanism uses context from attention layers to filter retrieved memories:

This prevents hallucination from irrelevant or conflicting memorized facts.

Key Finding: Optimal Sparsity Split

DeepSeek discovered a U-shaped scaling law for allocating parameters:

Allocation Performance
100% Neural Baseline
75-80% Neural + 20-25% Engram Optimal
60% Neural + 40% Engram Suboptimal

“Reallocating roughly 20%–25% of the sparse parameter budget to Engram yields the best performance.”

Results (Engram-27B)

Benchmark Type MoE Baseline Engram-27B Improvement
Complex Reasoning 70% 74% +4%
Knowledge Tests 57% 61% +4%
Code Generation Baseline Improved Consistent
Mathematics Baseline Improved Consistent

Tested on: Big-Bench Hard, ARC-Challenge, MMLU

Hardware Implications

Resource Traditional LLM Engram LLM
HBM Usage High (all params) Reduced (compute only)
System RAM Minimal Used for Engram tables
GPU Compute All operations Reasoning only

This decouples memory scaling from expensive HBM, enabling larger knowledge bases at lower cost.

When to Use

Implementation

# From DeepSeek's demo repository
pip install torch numpy transformers sympy
python engram_demo_v1.py
Pattern Relationship
Agentic RAG Alternative knowledge retrieval approach
MoE Architecture Neural backbone that Engram augments

Sources