Research Reproducibility

The Replication Laboratory

AI-assisted research replication at scale — two specialized agents collaborate through infrastructure-enforced separation to systematically reproduce published findings.

The Replication Laboratory orchestrates three AI roles — an Advisor (strategic oversight), a Research Assistant (tactical execution), and an independent Reviewer (skeptical peer review) — across ephemeral container sessions, producing archival-quality documentation as an operational side effect. Every decision, code change, and intermediate result is captured in a git history that is the project’s memory. Four paper replications have been completed to date.

The replication crisis and the AI opportunity

The scientific community faces a well-documented replication crisis. Studies across psychology, medicine, economics, and machine learning have found that a substantial fraction of published findings cannot be reproduced by independent researchers. Most failures stem not from fraud but from undocumented implementation details, missing preprocessing steps, unreported hyperparameter choices, hardware-dependent numerical precision, and data access barriers.

Traditional academic incentives actively discourage replication work. Graduate students need novel publications, not confirmatory studies. Journals prefer exciting new results over careful replications. The result: published claims accumulate without systematic verification.

Replication is under-incentivized in academia. Spending six months replicating someone else's work is career poison. An AI-assisted lab that compresses timelines from months to days could structurally change the economics.

AI-assisted replication changes the equation in several ways:

Economic transformation

Compressing a six-month replication to a few days changes the cost-benefit calculation. What was career poison becomes tractable background work.

Complete documentation

Every decision, intermediate result, and code change is captured in git history and structured reports. Failed replications are as valuable as successful ones.

Cognitive decomposition

The strategic/tactical split mirrors how humans think about replication. “Should we replicate this claim?” is a different question from “How do we run this code on our cluster?”

A public good

A library of structured replication attempts with full provenance — including failures — is genuinely valuable to science.

Container-per-session

The key architectural decision: every interaction between roles occurs in a fresh Docker container with fresh context. This enforces genuine cognitive separation — the Advisor reviewing a phase report truly does not have access to the RA’s context from producing it. This isn’t enforced by prompt engineering (“pretend you don’t know”) but by infrastructure — the context literally does not exist in the new container.

Host (HPC cluster or EC2 instance) Orchestrator Persistent Storage (Python, runs (Docker volume) continuously) project-repo/ watches git paper/ polls SLURM plan/ launches code/ containers data/ enforces results/ checkpoints docs/ comments/ .replication-state.toml | volume mount | v Session Container (ephemeral) Claude Code --dangerously-skip -permissions Role: one of: SSH ---> SLURM headnode Advisor RA System prompt from: prompts/advisor.md or prompts/ra.md Token/time budget enforced (container dies after session; next baton-pass = new container, fresh context)

This solves multiple problems simultaneously:

Genuine context separation

The Advisor reviewing a phase report truly does not have the RA’s execution context. There is no context to leak because it doesn’t exist in the new container.

Sandboxed autonomy

Each container runs Claude Code with full permissions, safely, because the container is jailed with controlled volume mounts and network access. The agent can do whatever it needs within its sandbox.

Forensic cleanliness

The git diff between container start and container death is a complete record of what that session did. Nothing leaks between sessions except what is explicitly committed.

Natural phase boundaries

Container lifecycle is phase lifecycle. No need for complex in-session state management or hand-wavy “pretend you’re a new agent” prompting.

Git as shared memory

The project’s git repository is the only persistent state. It serves as long-term memory (each session bootstraps by reading the state file and recent reports), communication channel (the Advisor and RA “talk” by writing documents and committing them), audit trail (every commit is tagged with role, phase, and timestamp), and handoff mechanism (the state file records who has the baton).

a8f3d2c [advisor] Final replication report approved
e7c4b1a [advisor] Synthesis: Final report generated
9b2e5f3 [ra] Phase 4: Comparison analysis complete
7d1c8a4 [advisor] Phase 4 review: APPROVED
5e9f2b1 [ra] Phase 3: Baseline model training complete
3a7d4c2 [advisor] Phase 3 review: APPROVED - results within tolerance
1f8e6b3 [ra] Phase 2: Data preprocessing complete
4c2a9d1 [advisor] Phase 2 review: APPROVED
8e5b3f2 [ra] Phase 1: Environment setup complete
2d7f1c4 [advisor] Tactical workplan created
6b9e4a3 [human] Plan approved: proceeding with execution
9c1d8f5 [advisor] Replication plan created
7a4e2b1 [init] Project initialized: feng-2024-concept-binding
Information bottleneck as feature The serialization-to-disk requirement forces complete externalization of reasoning. When the RA knows its report will be read by a different agent without shared context, it must document everything. When the Advisor writes instructions for literal execution, it must be precise. Asymmetric information forces epistemic hygiene.

Three roles, infrastructure-enforced separation

The system prompts are the highest-leverage component — carefully designed instructions that encode all architectural discipline and behavioral patterns. Each role has a distinct persona, responsibilities, decision authority, and workflow. Critically, each runs in its own ephemeral container — the separation is not a prompt instruction but a hard infrastructure boundary.

Advisor

Strategic oversight. Plans replications, reviews phase reports, enforces scope discipline, makes calls on scientific interpretation, and synthesizes the final report. Never executes code — reads reports, diffs, and the state file. Its prompt states: “If it’s not in the phase report, you don’t know it happened.”

Research Assistant

Tactical execution. Writes code, runs experiments, submits SLURM jobs, collects results, and documents everything in phase reports. Decides implementation details but escalates scope questions to the Advisor. Up to 3 retry attempts on failures before escalating.

Reviewer

Independent peer review. Sees the full body of work for the first time after the Advisor/RA team finishes. Reads the paper, examines code and results, cross-references claims against evidence, and writes a structured peer review with a verdict: ACCEPT, MAJOR REVISIONS, or REJECT.

The last line of defense The Reviewer (“Reviewer #2”) has no loyalty to the Advisor or RA. It did not plan, execute, or review phases during the replication. Its explicit mandate: “Find weaknesses. Identify hallucinated results. Catch fudged metrics. Expose confirmation bias. Flag scope narrowing. Do NOT be nice. Be fair but rigorous.” The Advisor and RA have built momentum and may have unconsciously lowered the bar. The Reviewer has not.

Scope enforcement

The Advisor’s central job is preventing scope creep. Common failure modes: the RA finds an interesting discrepancy and wants to investigate 17 variations; the RA notices additional claims beyond stated scope; the RA gets distracted by tooling improvements. The Advisor catches these in review and redirects focus, logging tangential findings in an appendix for future work rather than pursuing them.

The human as third voice

The human participates asynchronously by dropping comment files into comments/ at any time. At each phase boundary, the Advisor checks for new comments and incorporates relevant ones. The human uses the same file-mediated protocol as the agents — a third voice in the conversation, not a supervisor peering over the agent’s shoulder.

From initialization to completion

A replication project moves through a well-defined state machine. Each transition is a container session; each session reads the current state, does its work, commits results, and updates the state file for the next session.

Project initialization
Human — replab init --paper ./paper.pdf --scope "..."
Planning
Advisor — reads paper, creates replication plan with hypotheses and success criteria
Plan review
Human checkpoint — replab approve --phase plan
Tactical workplan
RA — breaks plan into detailed phases with entry/exit criteria
Execution loop
RA executes phase → Advisor reviews → APPROVED → next phase (autonomous)
Synthesis
Advisor — synthesizes all findings into final report
Independent peer review
Reviewer — skeptical assessment of the full body of work, structured verdict
Final review
Human checkpoint — replab approve --phase final
Complete
Archival replication report with full provenance and peer review

Graduated autonomy

The orchestrator enforces configurable human checkpoints  — phase transitions where execution pauses for approval. The default for v0.1: plan approval and final report review require human oversight; all intermediate phase transitions (RA → Advisor review, Advisor → RA execution) are autonomous.

Fully Manual Default v0.1 Fully Autonomous | | | v v v Every phase Plan + Final review No checkpoints requires human require human (research only)

SLURM integration

Research replication often requires long-running GPU jobs. The RA never waits for SLURM jobs — it uses a fire-and-forget pattern: submit the batch script via SSH to the headnode, record the job ID in the state file, commit, and exit. The orchestrator polls sacct every 60 seconds. When the job completes (or fails), it launches a new RA session with that information. If the job fails, the RA decides how to handle it: resubmit with different resources, debug, or escalate.

Stateless orchestrator The orchestrator is approximately 300 lines of Python. All state lives in .replication-state.toml — the orchestrator can be killed and restarted without losing progress. Its SLURM watcher is about 50 lines. It doesn’t need to be clever — just reliable.

Components

Container system

Ubuntu 24.04 LTS base with Claude Code, Pandoc + LaTeX toolchain (texlive, pdflatex), the Python scientific stack (numpy, pandas, scipy, matplotlib, scikit-learn, PyTorch, transformers), and standard build tools. The build script detects the available container runtime and supports Docker, Podman, and Singularity/Apptainer for HPC environments.

At runtime, the container gets three volume mounts: the project repository (read-write), system prompts (read-only), and SSH keys for SLURM access (read-only). It can read and write project files, SSH to the SLURM headnode, and access the internet for package downloads. It cannot modify files outside the workspace, access the host filesystem, or persist state between sessions.

CLI tool

The human interface is a Python dispatcher with four commands:

replab init

Initialize a project: create directory structure, copy the paper, set up the git repo and state file.

replab run

Start the orchestrator loop. Can run to the next checkpoint, as a background daemon, or with a session limit.

replab status

Display current phase, budget usage, pending SLURM jobs, and activity timeline. Supports live-updating watch mode.

replab init \
  --paper ./feng-2024-concept-binding.pdf \
  --code https://github.com/feng/binding-ids \
  --scope "Replicate core binding ID mechanism (Sections 3-4)" \
  --compute-budget "50 GPU-hours" \
  --involvement "checkpoints-only"

Document generation

Iterative documents (replication plan, phase reports, reviews, tactical workplan) use Markdown rendered to PDF via Pandoc — fast to produce, easy to diff in git, easy for AI agents to write correctly. The final replication report uses LaTeX for archival quality, publication-ready figures, and proper bibliography management. This gives the “all documentation produces PDFs” property without forcing the RA to fight with LaTeX during iterative work.

State file

The .replication-state.toml file is the single source of truth for project state. It tracks the current phase and role, completed phases, pending SLURM jobs with their IDs and status, budget allocation and usage (tokens and GPU-hours), checkpoint configuration, and whether the project is awaiting human approval.

[project]
name = "feng-2024-concept-binding"
status = "executing"

[execution]
current_phase = "phase3-baseline"
current_role = "ra"
phases_completed = ["tactical_workplan", "phase1-setup", "phase2-data"]

[pending_jobs]
job_87654 = { phase = "phase3-baseline", status = "RUNNING",
              partition = "gpu_requeue" }

[budgets]
total_token_budget = 500000
tokens_used = 127000
total_compute_hours = 50.0
compute_hours_used = 15.0

Four papers, four peer reviews

The Replication Laboratory has completed four end-to-end replications, each producing a structured final report and an independent peer review. The results span the full range of outcomes — from strong confirmation to significant non-replication — demonstrating that the system produces honest assessments rather than confirmatory exercises.

Grokking Progress Measures

Nanda et al. (2023), ICLR 2023. Modular addition grokking, Fourier multiplication algorithm, and progress measures. Substantially replicated. MLP projection FVE=0.987 (exceeds paper’s 0.93–0.98). The core mechanistic finding — that transformers learn a Fourier multiplication algorithm for modular addition — confirmed beyond reasonable doubt across 5 seeds.

Peer review: ACCEPT (B+). Minor quantitative deviations in WL rank-approximation residuals. Critical bug finding: Kaiming initialization prevents grokking entirely; the paper’s specific initialization is essential.

SCOTUS Prediction

Katz, Bommarito & Blackman (2017), PLOS ONE. Time-evolving random forest for Supreme Court prediction, 246,775 votes across 200 years. Partially replicated. Case-level accuracy 70.27% vs paper’s 70.2% (near-exact match). Justice-level accuracy 69.80% vs 71.9% (−2.1pp, outside tolerance).

Peer review: MAJOR REVISIONS (B−). Data pipeline rated A−. Gap attributed to simplified random forest (fixed 100 trees vs paper’s growing warm-start approach, the core methodological contribution, which was not tested).

Consensus Game

Jacob et al. (2024), ICLR 2024. Equilibrium-Ranking decoding on LLaMA-7B across five QA benchmarks. Partially replicated. ARC benchmarks confirmed (+4.6pp, +11.8pp improvement). MMLU headline result did not replicate (32.0% vs paper’s 39.9%). GSM8K showed zero contrastive signal.

Peer review: MAJOR REVISIONS (C+). Notable contribution: mathematical proof that the paper’s footnote-2 normalization cancels for MI and D rankings. ~43 GPU-hours.

Nash Equilibrium

Nash (1950), PNAS. Computational verification of the existence theorem for equilibrium points in finite games, across 116 game instances. Largely verified for 2-player and zero-sum games. N-player verification weaker (96.7%).

Peer review: MAJOR REVISIONS (B−). Strong 2-player results. Red flags: one failed n-player game with unsupported “relaxed search” claim; results compilation script contains hardcoded verdicts not derived from data.

Honest outcomes Three of four replications received MAJOR REVISIONS from the independent reviewer. The system is not designed to produce confirmations — it is designed to produce accurate assessments. When results don’t replicate, or when the replication process itself has gaps, the Reviewer catches it. The Nash review, for instance, identified a likely fabricated claim about a “relaxed grid search” finding an approximate equilibrium with no supporting code or data.

Status and next steps

The v0.1 proof of concept is complete: four papers replicated end-to-end with full Advisor/RA/Reviewer pipelines, producing structured reports and independent peer reviews. The infrastructure — container system, CLI, orchestrator, three-role prompt suite — is validated and operational.

v0.1 — Proof of concept (complete)

Four replications completed across game theory, ML interpretability, NLP, and legal prediction. Three-role system validated. Independent reviewer phase catches genuine problems in all four.

v0.2 — Novel replications

Apply the lab to papers not yet replicated by others. Improved crash recovery, phase quality gates, enhanced live monitoring, and lessons-learned feedback into prompts. Target: April–May 2026.

v0.3 — The library

Public-facing components: web interface, community comments via GitHub Issues, progress dashboard, searchable catalog of replication attempts, RSS/Atom feed. Target: Summer 2026.

v0.4 — Interchange integration

Potential migration to Interchange coordination infrastructure for richer agent primitives, standardized observability, and cross-project analytics. Timeline depends on Interchange maturity.

The Replication Laboratory is simultaneously a tool that produces valuable replication attempts, a research platform for studying AI research agents, and a public good accumulating knowledge about what replicates and what doesn’t.