AI-assisted research replication at scale — two specialized agents collaborate through infrastructure-enforced separation to systematically reproduce published findings.
The Replication Laboratory orchestrates three AI roles — an Advisor (strategic oversight), a Research Assistant (tactical execution), and an independent Reviewer (skeptical peer review) — across ephemeral container sessions, producing archival-quality documentation as an operational side effect. Every decision, code change, and intermediate result is captured in a git history that is the project’s memory. Four paper replications have been completed to date.
The scientific community faces a well-documented replication crisis. Studies across psychology, medicine, economics, and machine learning have found that a substantial fraction of published findings cannot be reproduced by independent researchers. Most failures stem not from fraud but from undocumented implementation details, missing preprocessing steps, unreported hyperparameter choices, hardware-dependent numerical precision, and data access barriers.
Traditional academic incentives actively discourage replication work. Graduate students need novel publications, not confirmatory studies. Journals prefer exciting new results over careful replications. The result: published claims accumulate without systematic verification.
AI-assisted replication changes the equation in several ways:
Compressing a six-month replication to a few days changes the cost-benefit calculation. What was career poison becomes tractable background work.
Every decision, intermediate result, and code change is captured in git history and structured reports. Failed replications are as valuable as successful ones.
The strategic/tactical split mirrors how humans think about replication. “Should we replicate this claim?” is a different question from “How do we run this code on our cluster?”
A library of structured replication attempts with full provenance — including failures — is genuinely valuable to science.
The key architectural decision: every interaction between roles occurs in a fresh Docker container with fresh context. This enforces genuine cognitive separation — the Advisor reviewing a phase report truly does not have access to the RA’s context from producing it. This isn’t enforced by prompt engineering (“pretend you don’t know”) but by infrastructure — the context literally does not exist in the new container.
This solves multiple problems simultaneously:
The Advisor reviewing a phase report truly does not have the RA’s execution context. There is no context to leak because it doesn’t exist in the new container.
Each container runs Claude Code with full permissions, safely, because the container is jailed with controlled volume mounts and network access. The agent can do whatever it needs within its sandbox.
The git diff between container start and container death is a complete record of what that session did. Nothing leaks between sessions except what is explicitly committed.
Container lifecycle is phase lifecycle. No need for complex in-session state management or hand-wavy “pretend you’re a new agent” prompting.
The project’s git repository is the only persistent state. It serves as long-term memory (each session bootstraps by reading the state file and recent reports), communication channel (the Advisor and RA “talk” by writing documents and committing them), audit trail (every commit is tagged with role, phase, and timestamp), and handoff mechanism (the state file records who has the baton).
a8f3d2c [advisor] Final replication report approved
e7c4b1a [advisor] Synthesis: Final report generated
9b2e5f3 [ra] Phase 4: Comparison analysis complete
7d1c8a4 [advisor] Phase 4 review: APPROVED
5e9f2b1 [ra] Phase 3: Baseline model training complete
3a7d4c2 [advisor] Phase 3 review: APPROVED - results within tolerance
1f8e6b3 [ra] Phase 2: Data preprocessing complete
4c2a9d1 [advisor] Phase 2 review: APPROVED
8e5b3f2 [ra] Phase 1: Environment setup complete
2d7f1c4 [advisor] Tactical workplan created
6b9e4a3 [human] Plan approved: proceeding with execution
9c1d8f5 [advisor] Replication plan created
7a4e2b1 [init] Project initialized: feng-2024-concept-binding
The system prompts are the highest-leverage component — carefully designed instructions that encode all architectural discipline and behavioral patterns. Each role has a distinct persona, responsibilities, decision authority, and workflow. Critically, each runs in its own ephemeral container — the separation is not a prompt instruction but a hard infrastructure boundary.
Strategic oversight. Plans replications, reviews phase reports, enforces scope discipline, makes calls on scientific interpretation, and synthesizes the final report. Never executes code — reads reports, diffs, and the state file. Its prompt states: “If it’s not in the phase report, you don’t know it happened.”
Tactical execution. Writes code, runs experiments, submits SLURM jobs, collects results, and documents everything in phase reports. Decides implementation details but escalates scope questions to the Advisor. Up to 3 retry attempts on failures before escalating.
Independent peer review. Sees the full body of work for the first time after the Advisor/RA team finishes. Reads the paper, examines code and results, cross-references claims against evidence, and writes a structured peer review with a verdict: ACCEPT, MAJOR REVISIONS, or REJECT.
The Advisor’s central job is preventing scope creep. Common failure modes: the RA finds an interesting discrepancy and wants to investigate 17 variations; the RA notices additional claims beyond stated scope; the RA gets distracted by tooling improvements. The Advisor catches these in review and redirects focus, logging tangential findings in an appendix for future work rather than pursuing them.
The human participates asynchronously by dropping comment files into
comments/ at any time. At each phase boundary, the Advisor
checks for new comments and incorporates relevant ones. The human uses
the same file-mediated protocol as the agents — a third
voice in the conversation, not a supervisor peering over the agent’s
shoulder.
A replication project moves through a well-defined state machine. Each transition is a container session; each session reads the current state, does its work, commits results, and updates the state file for the next session.
replab init --paper ./paper.pdf --scope "..."replab approve --phase planreplab approve --phase finalThe orchestrator enforces configurable human checkpoints  — phase transitions where execution pauses for approval. The default for v0.1: plan approval and final report review require human oversight; all intermediate phase transitions (RA → Advisor review, Advisor → RA execution) are autonomous.
Research replication often requires long-running GPU jobs. The RA never
waits for SLURM jobs — it uses a fire-and-forget
pattern: submit the batch script via SSH to the headnode, record the
job ID in the state file, commit, and exit. The orchestrator polls
sacct every 60 seconds. When the job completes (or fails),
it launches a new RA session with that information. If the job fails,
the RA decides how to handle it: resubmit with different resources,
debug, or escalate.
.replication-state.toml — the
orchestrator can be killed and restarted without losing progress.
Its SLURM watcher is about 50 lines. It doesn’t need to be
clever — just reliable.
Ubuntu 24.04 LTS base with Claude Code, Pandoc + LaTeX toolchain (texlive, pdflatex), the Python scientific stack (numpy, pandas, scipy, matplotlib, scikit-learn, PyTorch, transformers), and standard build tools. The build script detects the available container runtime and supports Docker, Podman, and Singularity/Apptainer for HPC environments.
At runtime, the container gets three volume mounts: the project repository (read-write), system prompts (read-only), and SSH keys for SLURM access (read-only). It can read and write project files, SSH to the SLURM headnode, and access the internet for package downloads. It cannot modify files outside the workspace, access the host filesystem, or persist state between sessions.
The human interface is a Python dispatcher with four commands:
Initialize a project: create directory structure, copy the paper, set up the git repo and state file.
Start the orchestrator loop. Can run to the next checkpoint, as a background daemon, or with a session limit.
Display current phase, budget usage, pending SLURM jobs, and activity timeline. Supports live-updating watch mode.
replab init \
--paper ./feng-2024-concept-binding.pdf \
--code https://github.com/feng/binding-ids \
--scope "Replicate core binding ID mechanism (Sections 3-4)" \
--compute-budget "50 GPU-hours" \
--involvement "checkpoints-only"
Iterative documents (replication plan, phase reports, reviews, tactical workplan) use Markdown rendered to PDF via Pandoc — fast to produce, easy to diff in git, easy for AI agents to write correctly. The final replication report uses LaTeX for archival quality, publication-ready figures, and proper bibliography management. This gives the “all documentation produces PDFs” property without forcing the RA to fight with LaTeX during iterative work.
The .replication-state.toml file is the single source of
truth for project state. It tracks the current phase and role, completed
phases, pending SLURM jobs with their IDs and status, budget allocation
and usage (tokens and GPU-hours), checkpoint configuration, and whether
the project is awaiting human approval.
[project]
name = "feng-2024-concept-binding"
status = "executing"
[execution]
current_phase = "phase3-baseline"
current_role = "ra"
phases_completed = ["tactical_workplan", "phase1-setup", "phase2-data"]
[pending_jobs]
job_87654 = { phase = "phase3-baseline", status = "RUNNING",
partition = "gpu_requeue" }
[budgets]
total_token_budget = 500000
tokens_used = 127000
total_compute_hours = 50.0
compute_hours_used = 15.0
The Replication Laboratory has completed four end-to-end replications, each producing a structured final report and an independent peer review. The results span the full range of outcomes — from strong confirmation to significant non-replication — demonstrating that the system produces honest assessments rather than confirmatory exercises.
Nanda et al. (2023), ICLR 2023. Modular addition grokking, Fourier multiplication algorithm, and progress measures. Substantially replicated. MLP projection FVE=0.987 (exceeds paper’s 0.93–0.98). The core mechanistic finding — that transformers learn a Fourier multiplication algorithm for modular addition — confirmed beyond reasonable doubt across 5 seeds.
Peer review: ACCEPT (B+). Minor quantitative deviations in WL rank-approximation residuals. Critical bug finding: Kaiming initialization prevents grokking entirely; the paper’s specific initialization is essential.
Katz, Bommarito & Blackman (2017), PLOS ONE. Time-evolving random forest for Supreme Court prediction, 246,775 votes across 200 years. Partially replicated. Case-level accuracy 70.27% vs paper’s 70.2% (near-exact match). Justice-level accuracy 69.80% vs 71.9% (−2.1pp, outside tolerance).
Peer review: MAJOR REVISIONS (B−). Data pipeline rated A−. Gap attributed to simplified random forest (fixed 100 trees vs paper’s growing warm-start approach, the core methodological contribution, which was not tested).
Jacob et al. (2024), ICLR 2024. Equilibrium-Ranking decoding on LLaMA-7B across five QA benchmarks. Partially replicated. ARC benchmarks confirmed (+4.6pp, +11.8pp improvement). MMLU headline result did not replicate (32.0% vs paper’s 39.9%). GSM8K showed zero contrastive signal.
Peer review: MAJOR REVISIONS (C+). Notable contribution: mathematical proof that the paper’s footnote-2 normalization cancels for MI and D rankings. ~43 GPU-hours.
Nash (1950), PNAS. Computational verification of the existence theorem for equilibrium points in finite games, across 116 game instances. Largely verified for 2-player and zero-sum games. N-player verification weaker (96.7%).
Peer review: MAJOR REVISIONS (B−). Strong 2-player results. Red flags: one failed n-player game with unsupported “relaxed search” claim; results compilation script contains hardcoded verdicts not derived from data.
The v0.1 proof of concept is complete: four papers replicated end-to-end with full Advisor/RA/Reviewer pipelines, producing structured reports and independent peer reviews. The infrastructure — container system, CLI, orchestrator, three-role prompt suite — is validated and operational.
Four replications completed across game theory, ML interpretability, NLP, and legal prediction. Three-role system validated. Independent reviewer phase catches genuine problems in all four.
Apply the lab to papers not yet replicated by others. Improved crash recovery, phase quality gates, enhanced live monitoring, and lessons-learned feedback into prompts. Target: April–May 2026.
Public-facing components: web interface, community comments via GitHub Issues, progress dashboard, searchable catalog of replication attempts, RSS/Atom feed. Target: Summer 2026.
Potential migration to Interchange coordination infrastructure for richer agent primitives, standardized observability, and cross-project analytics. Timeline depends on Interchange maturity.