TL;DR
- All five measured metrics meet or exceed targets — ED 100%, SEA 100%, PLR 0%, PSR 0% (49/49), CRR 100%
- Citation Resolution Rate went from 48.6% to 100% across three versions — the metrics created the feedback loop that drove the improvement
- CompileBench proposes how to evaluate compilation decisions directly: what was selected, filtered, compressed, or excluded by policy
- Two papers form a coherent research stack: metrics (what to measure) + theory (how to architect) + CompileBench (how to evaluate)
In Part 1, I argued that existing benchmarks only measure retrieval — the easiest part. In Part 2, I introduced Context Compilation Theory: the missing systems layer between access and reasoning.
This final part brings the evidence. Measured metrics that validate the architecture. The CRR journey that demonstrates metric-guided engineering. CompileBench as the evaluation agenda. And a proposal for open standards.
Measured on a live MemoryOS instance
0
Documents
0
Vector Chunks
0
Structured Entities
0
Data Sources
The Metrics That Validate the Theory
The eight metrics defined in the Context OS Metrics specification measure what standard benchmarks miss. Five are measured on a live system. Three require labeled evaluation datasets and are ready to run.
| # | Metric | What it measures | Result | Target |
|---|---|---|---|---|
| 1 | Evidence Density | Grounding of compiled output | 100% | ≥ 85% |
| 2 | Pack Relevance Score | Packing precision under budget | Eval ready | ≥ 70% |
| 3 | Contradiction Detection F1 | Temporal consistency awareness | Eval ready | ≥ 0.70 |
| 4 | Scope Enforcement Accuracy | Policy correctness at retrieval | 100% | ≥ 99.9% |
| 5 | Permission Leakage Rate | Post-retrieval safety | 0% | < 0.1% |
| 6 | Poisoning Susceptibility Rate | Write-path attack resistance | 0% (49/49) | < 5% |
| 7 | Citation Resolution Rate | Provenance verifiability | 100% | ≥ 95% |
| 8 | Context Compiler Efficiency | Economic case for compilation | Eval ready | > 1.2x |
Evidence Density: 100%
Every token in every context pack traces to a source document via the source_refs provenance chain. This includes entity items — action items, decisions, commitments — which carry their source document path through the MemoryObject base class.
Evidence Density by Intent
Every pack token traces to a source document via lineage —100%
An important distinction: ED measures structural traceability — whether a source path exists and resolves — not whether the linked source is semantically the best match. ED is necessary but not sufficient; CRR complements it by verifying that citations resolve to real artifacts.
Scope Enforcement and Permission Leakage
100% SEA. Tested across 32 individual scope decisions — 4 principal types evaluating 8 crafted candidates at 4 sensitivity levels across 2 domain types. The policy engine is deterministic: the sensitivity matrix is fixed, and every decision is logged.
Sensitivity Matrix: OutputTransformPolicy
Maps (caller clearance × content sensitivity) → output transform. Click any cell to see what happens.
| Sensitivity ↓ / Principal → | Owner | Internal | External |
|---|---|---|---|
| Public | |||
| Internal | |||
| Confidential | |||
| Restricted |
0% PLR. When the OutputTransformPolicy returns DENIED, the restricted content produces an empty string. It never enters the pack, never reaches the model, and can't leak because it was never there. We believe pre-retrieval filtering is the strongest approach for this threat model. Alternative approaches — post-generation filtering, differential privacy — address the same concern differently and may be appropriate in other contexts.
Poisoning Resistance: 49/49 Blocked
Adversarial Attack Resistance
49 test cases across 5 attack families — structured ingestion pipelines block all injection attempts
0/49
Blocked
Write Injection
15 test cases
Scope Escalation
12 test cases
Persistence Abuse
8 test cases
Provenance Forgery
8 test cases
Cross-Tenant Leakage
6 test cases
Why 0% susceptibility: MemoryOS does not have an open "remember X" API path. Memory writes go through structured ingestion pipelines with schema validation, not through LLM-mediated commands.
The architectural reason is straightforward: MemoryOS does not have an open "remember X" API path. Memory writes go through structured ingestion pipelines with schema validation. This is not a guardrail — it's an architecture. A guardrail can be bypassed with a clever enough prompt. An architecture that doesn't have the vulnerable pathway can't be bypassed because the pathway doesn't exist.
A note on these results: perfect scores should invite scrutiny of the test suite, not just confidence in the system. Our test coverage — 32 scope decisions, 49 adversarial cases, 185 citation checks — is meaningful but not exhaustive. We expect these numbers to face downward pressure as test suites grow. Publishing the test suite openly is how we invite that pressure.
The CRR Journey: 48.6% to 100%
Citation Resolution Rate is the metric that best demonstrates why publishing honest measurements matters.
Citation Resolution Rate: Metric-Guided Engineering
Publishing 48.6% created the feedback loop that drove it to 100%
100%
CRR (current)
Measured at the retrieval channel level
Shifted to per-item citation verification
Entity items inherit source_refs from MemoryObject
The feedback loop: Publishing the honest 48.6% identified the structural gap. Each version addressed a specific measurement finding. This is exactly the engineering feedback loop these metrics are designed to create.
v1 (48.6%): CRR was measured at the retrieval channel level. Channels returning 0 candidates counted as unresolvable — valid provenance, but the metric exposed that our measurement approach was too coarse.
v2 (91.4%): CRR shifted to per-item verification. Most items resolved, but entity items lacked document_path. The metric identified exactly which items had the gap.
v3 (100%): Entity items now inherit source_refs[0].path from their parent MemoryObject. Every item resolves to a real artifact. Verified across 185 items from 6 context pack assemblies.
Publishing the honest 48.6% created the feedback loop. Each measurement identified a specific structural gap. Each version fixed it. This is exactly the engineering cycle these metrics are designed to enable.
What We Haven't Measured Yet
Five of eight metrics are measured. Three are not, and intellectual honesty requires saying so clearly:
- Pack Relevance Score (PRS) and Contradiction Detection F1 (CDF1) require labeled evaluation datasets that don't yet exist for personal work data at this scale.
- Context Compiler Efficiency (CCE) requires paired evaluation runs comparing compiled packs against full-context baselines.
Context Compiler Efficiency
Illustrative example — not measured data. Shows the theoretical compiler argument.
The chart above is an illustrative example, not measured data. CCE requires paired evaluation runs. The harness is built; the evaluation is in progress.
All measured results come from a single MemoryOS instance processing one user's enterprise work data. We don't claim these specific numbers generalize across deployments. What generalizes is the metric framework and measurement methodology, which any system can apply to its own data.
Scalability
Scalability: 1K to 100M Documents
Stays flat as data volume grows — the compiler's evidence ratio is architecturally stable
The 72K data point is measured. Projections to larger scales are based on algorithmic complexity analysis — O(log N) for ANN search, O(1) for policy lookup — and assume no architectural bottlenecks at higher volumes. These are architectural properties, not yet validated at 100M-document scale.
CompileBench: The Benchmark That Doesn't Exist Yet
Standard benchmarks ask if the answer was correct. CompileBench asks what happened during compilation. What was selected? What was omitted? What was summarized? What remained verbatim? Was provenance preserved? Was policy respected? Did the pack stay inside budget and freshness constraints?
CompileBench: Evaluating Compilation, Not Just Answers
Standard benchmarks ask if the answer was correct. CompileBench asks what happened during compilation.
Four Task Families
Multi-session conversations where earlier context must survive compilation across sessions
Project-oriented tasks requiring compilation from multiple sources, people, and time ranges
Tasks where the compiled context directly drives agent actions, not just answers
Tasks where compilation must respect sensitivity levels, domain boundaries, and redaction policies
Compilation-Oriented Metrics
The evaluative shift: Instead of asking only whether the final answer was correct, CompileBench exposes the compilation decisions themselves — what was selected, filtered, compressed, preserved verbatim, or excluded by policy. This is a benchmark specification, not a claim that the category already has a finished universal benchmark.
CompileBench is an evaluation agenda and benchmark specification, not a claim that the category already has a finished universal benchmark. The value is that it makes the evaluation target explicit enough for systems to expose and compare compilation behavior.
CompileBench is defined as code in the MemoryOS repository: evaluation/compilebench/ includes task families, baselines, and metrics.
The Research Stack
This series covers two published papers that form a coherent program:
The Research Stack
Three layers that form a coherent program: what to measure, how to architect, and how to evaluate.
CompileBench
Benchmark specification for context compilation
How should we evaluate compilation decisions across tasks and runtimes?
Context Compilation Theory
Read the paperArchitecture, Context IR, and the optimization formulation
What is the missing systems layer between retrieval and reasoning?
Context OS Metrics
Read the specificationEight metrics for governed context systems
What properties should a Context Operating System measure?
The metrics paper defines what matters for governed context systems. The context compilation paper defines how to architect the layer that makes those properties legible. CompileBench sketches the benchmark overlay needed to test compilation quality directly.
Together, they argue that the future of AI systems depends not just on better models or more memory, but on a durable way to turn heterogeneous evidence into a governed working set that can survive changing models and changing interfaces.
The Open Standard Proposal
The eight metrics are Apache-2.0 licensed. The reference implementations are in evaluation/tools/novel_metrics.py. The safety test suite is published at evaluation/datasets/safety_suite_v1.jsonl.
We propose them as open standards for the Context OS category:
- Implement the
PackItem,ScopeDecision,InjectionResult, andCitationdata structures for your system's output format. - Call the metric functions with your data.
- Report the results alongside standard IR metrics (Recall@K, MRR, nDCG).
- If you publish results, include system configuration, data scale, and test suite version for reproducibility.
The Eight Metrics — Open Standard Proposal
Apache-2.0 licensed · Reference implementations included
| Metric | Code | Result | Target | Status |
|---|---|---|---|---|
| Evidence Density | ED | 100% | ≥ 85% | Measured |
| Pack Relevance Score | PRS | — | ≥ 70% | Eval Ready |
| Contradiction Detection F1 | CDF1 | — | ≥ 0.70 | Eval Ready |
| Scope Enforcement Accuracy | SEA | 100% | ≥ 99.9% | Measured |
| Permission Leakage Rate | PLR | 0% | < 0.1% | Measured |
| Poisoning Susceptibility Rate | PSR | 0% (49/49) | < 5% | Measured |
| Citation Resolution Rate | CRR | 100% | ≥ 95% | Measured |
| Context Compiler Efficiency | CCE | — | > 1.2x | Eval Ready |
The AI memory space needs shared standards, not marketing benchmarks. When every vendor picks their own metric and optimizes for it, comparisons are impossible. When everyone runs the same eight metrics, the numbers speak for themselves.
All five measured metrics meet or exceed their targets. Three more are eval-harness-ready. The CRR progression from 48.6% to 100% demonstrates the core value: these metrics don't just measure systems — they create feedback loops that improve them.
If your AI memory vendor claims "enterprise-ready," ask them which of these eight they can run.
The full metrics specification is at github.com/Brianletort/MemoryOS. The context compilation theory is formalized in Toward a Theory of Context Compilation for Human-AI Systems. All code, metrics, and test suites are Apache-2.0 licensed.