cooperate.social

Panopticon

Steerable, observable inference for open-weight language models

An instrumented inference server that makes the forward pass transparent and controllable. Steer model behavior with learned activation vectors. Observe internal representations in real time. Measure alignment at every token.

Why build this

Language models are opaque by design. Standard inference APIs return text — and nothing else. The internal representations that determine what the model says and how it says it remain hidden behind billions of parameters.

Panopticon treats the forward pass as a sequence of observable, interceptable stages. Researchers specify what to observe, how to intervene, and what conditions should halt generation — all via a JSON protocol. The control loop runs server-side at generation speed; clients receive structured telemetry.

The name references Jeremy Bentham's 1791 architectural concept: a structure designed so that a central observer can inspect all occupants simultaneously. We apply the same principle to AI systems — comprehensive inspection of activations, attention patterns, and behavioral alignment at every step of generation.

The goal is infrastructure for oversight, not just inference. Every token produced carries a full telemetry record: log-probabilities, entropy, probe scores, steering modulation, and hidden-state norms across all layers.

Steer

Inject learned activation vectors at specific transformer layers to shift model behavior along interpretable dimensions — formality, creativity, sycophancy, and more.

Observe

Capture hidden-state norms, attention entropy, and probe scores at every token. Stream telemetry to clients in real time for live visualization.

Control

Adaptive steering uses a closed-loop feedback controller that reads alignment probes and modulates injection strength — preventing coherence breakdown while maintaining behavioral shift.

Three-layer design

Panopticon is built as an inference engine that runs a manual autoregressive loop, an instrumentation plane that installs hooks into the transformer, and a transport layer that delivers results over stdio or HTTP.

┌─────────────────────┐ Client (JSON) ────────────────▶│ Transport Layer │ │ stdio (JSONL) or │ │ HTTP (SSE streaming) │ └─────────┬───────────┘ │ ┌─────────▼───────────┐ │ Instrumentation │ │ ┌─ Steering hooks │ │ ├─ Probe hooks │ │ ├─ Hidden capture │ │ └─ Attention capture│ └─────────┬───────────┘ │ ┌─────────▼───────────┐ │ Inference Engine │ │ Manual AR loop │ │ Token-level access │ │ DynamicCache KV │ └─────────────────────┘

Inference Engine

Unlike standard inference libraries, Panopticon does not call model.generate(). Instead, it runs a manual autoregressive loop: one forward pass per token, with direct access to pre-sampling logits, hidden states, and attention weights at every step.

This gives the instrumentation layer complete control. Each forward pass populates a KV cache, and sampling parameters (temperature, top-k, top-p) are applied explicitly.

Instrumentation Plane

PyTorch forward hooks are installed on specific decoder layers. Hook ordering is critical: steering hooks register first, observation hooks second. This guarantees that telemetry captures post-intervention activations.

All state is per-request. Each generation call creates its own steering and hook managers, ensuring complete isolation between concurrent requests.

Design decision The manual autoregressive loop adds implementation complexity but is non-negotiable: it is the only way to intercept every token's hidden states, apply per-token steering feedback, and measure alignment before the next token is generated.

Learned directions in activation space

Steering vectors are directions in a model's hidden-state space that correspond to interpretable behavioral dimensions. They are extracted via Contrastive Activation Addition (CAA) and applied at runtime by perturbing activations at a chosen transformer layer.

Extraction pipeline

  1. Contrastive generation

    For a target concept (e.g. formality), the model is run on diverse prompts with polar system messages: one eliciting maximally formal responses, the other maximally informal. Hidden states are captured at the extraction layer.

  2. Direction computation

    The mean difference between positive and negative activations yields a direction vector. This is unit-normalized and saved as the concept's steering vector.

  3. Two-vector system

    Each concept produces a probe vector (for measurement, typically at a downstream layer where the concept is most separable) and an inject vector (for intervention, at an earlier layer where perturbations propagate most effectively). Decoupling these enables better steering fidelity.

Injection modes

Constant mode

Constant injection hl ← hl + α · ‖hl‖ · v̂

A fixed perturbation is added to the hidden state at the injection layer on every token after the prompt. The magnitude scales with the residual stream norm, keeping the intervention proportional to the model's own activation scale. Simple and predictable, but can overwhelm the model at high alpha values.

Adaptive mode

Probe-gated injection deficit = max(0, threshold − probe) / sensitivity modulation = clamp(1 − gain · deficit, mmin, 3.0) hl ← hl + α · ‖hl‖ · modulation · v̂

A closed-loop controller. A probe at a downstream layer measures alignment via cosine similarity. When the model already exhibits the target behavior, steering backs off. When it drifts, steering increases. An EMA filter prevents oscillation. This allows much higher alpha values without coherence breakdown.

Available concepts

Vectors are model-specific. The current library includes 24 calibrated concepts for Qwen 2.5 14B:

agreeableness
analytical thinking
assertiveness
competitiveness
confidence
cooperation
creativity
curiosity
directness
empathy
evil
formality
humor
idealism
independence
nihilism
optimism
power seeking
pragmatism
refusal
risk aversion
skepticism
sycophancy
verbosity

New concepts can be extracted from any open-weight model in minutes using contrastive prompt pairs. The calibration pipeline automatically discovers optimal layers, alpha ranges, and quality grades.

Per-token telemetry

Every token produced by Panopticon carries a full telemetry record. Observation is not an afterthought — it is the primary design goal. The inference engine exists to make the forward pass legible.

Token-level metrics

Log-probability — confidence in the sampled token.

Entropy — Shannon entropy over the full vocabulary. High entropy signals uncertainty; low entropy signals confident or repetitive generation.

Top-k alternatives — the tokens the model nearly chose, with their probabilities.

Probe score — cosine similarity between the hidden state and the steering vector at the probe layer. A direct measure of behavioral alignment.

Modulation — the feedback coefficient from the adaptive controller. Reveals when and why steering intensifies or relaxes.

Effective alpha — actual perturbation magnitude after norm scaling and modulation.

Passive probes

Panopticon can measure alignment with multiple concepts simultaneously, even when steering only one (or none). Passive probes stack their direction vectors into a matrix and compute all cosine similarities in a single batched operation per token:

Passive probe (batched) scores = M · ĥ
where M ∈ ℝk×d (k probe vectors)

This enables live dashboards showing how a model's behavioral profile shifts across dimensions in real time — watching sycophancy, formality, and creativity respond simultaneously as a single steering vector is applied.

Hidden-state norms

L2 norms of the hidden state at every layer, at every token. Captures the activation magnitude trajectory through the model — useful for detecting when steering perturbations are proportionate to the model's own scale.

Output format All metrics are written as JSONL (one object per token) for offline analysis, and streamed as Server-Sent Events for real-time visualization. A typical 512-token generation produces approximately 1.6 MB of structured telemetry.

Psychometric profiling

Extracting a steering vector is straightforward. Knowing how to use it is not. Which transformer layer should it be injected at? Which layer gives the most reliable probe readings? At what intensity does the steering become perceptible? At what intensity does it cause incoherence? How does the concept interact with the model's existing behavioral tendencies?

These questions cannot be answered from the vector alone. They require a systematic characterization of the model's response to each concept across its full layer depth and steering intensity range. Panopticon's calibration pipeline performs this characterization automatically, generating a complete psychometric profile for every concept in the vector library. The full pipeline has been run for Qwen 2.5 at 7B, 14B, and 72B parameter scales.

Calibration is the most computationally expensive step in the pipeline — and the most important. Without it, steering is guesswork. With it, every concept comes with validated operating parameters, a quality grade, and a measured therapeutic window.

The calibration pipeline has two major stages: an intrinsic stage that characterizes activations inside the network, and an extrinsic stage that measures whether steering produces a behaviorally detectable effect in the model's output. The distinction is critical: a concept can show strong separability in the activation space (high d′) while producing no perceptible behavioral change in generated text. Only the extrinsic evaluation — the psychometric sweep — answers the question that matters.

Intrinsic calibration

Behavioral baseline

The pipeline begins by establishing how the model behaves without any intervention. Twenty-four diverse prompts — spanning opinions, advice, explanation, creative writing, and social interaction — are run with a neutral system message. At every decoder layer, the pipeline records hidden-state L2 norms and cosine similarities with each concept's probe vector.

This produces a baseline behavioral fingerprint: how much each concept is naturally present in unsteered generation, and how activation magnitudes grow through the model's layers. The per-layer norm statistics also yield a norm scaling factor that ensures steering perturbations remain proportional to the model's own activation scale — so that α = 1.0 corresponds to one standard deviation of natural variation, regardless of model size. Typical activation norms range from ~60–70 for 7B models to ~200–350 for 72B, so this normalization is essential for consistent cross-model behavior.

Dual norm scale Calibration produces two separate scaling factors. The probe scale (norm_scale_factor) is optimized for internal separability — it produces perturbations that probes detect with high d′. But these perturbations are typically only ~5% of the hidden-state norm: invisible to external observers. The behavioral steer scale (norm_scale_steer) is 6× larger, producing ~32% perturbation — enough for clear behavioral change without incoherence. This 6× multiplier was determined empirically through the psychometric sweeps described below, by comparing probe detection thresholds to human/LLM judge detection thresholds. At runtime, injection hooks use the steer scale while probes use pure cosine similarity with no norm scaling.

Layer separability sweep

For each concept, the same 24 prompts are run twice more: once with a system message maximizing the concept (“Respond in a way that is extremely [concept]. Exaggerate this quality as much as possible.”) and once with the opposite (“Respond in a way that is the complete opposite of [concept]. Avoid this quality entirely.”). Hidden states are captured at every layer.

At each layer, the pipeline computes two measures of how well positive and negative activations can be distinguished:

Fisher discriminant (d′)

Activations are projected onto the mean-difference direction between positive and negative samples. The Fisher discriminant measures the separation between the two projected distributions, normalized by their pooled variance. A d′ > 2.0 indicates strong separability.

Cross-validated accuracy

Six-fold cross-validation with a logistic classifier on the projected axis. Provides a robust, unbiased estimate of how reliably the layer can distinguish the concept's presence from its absence. An accuracy above 0.85 confirms strong signal.

The sweep produces a per-layer separability curve for each concept: d′ and classification accuracy plotted across all decoder layers. These curves reveal where in the network each behavioral concept emerges and becomes measurable. The optimal probe layer is the peak of the d′ curve. The optimal inject layer is chosen earlier in the network (with a minimum gap of two layers), where perturbations propagate forward most effectively through the remaining layers before reaching the probe.

Why per-concept layers matter The optimal layers vary significantly across concepts and models. For Qwen 2.5 14B, curiosity peaks at layer 40 (near the end) while empathy peaks at layer 24 (just past the midpoint). Using a single hardcoded layer for all concepts — as most steering implementations do — leaves substantial performance on the table.
Known limitation: observability ≠ controllability The current calibration selects injection layers by maximizing probe d′. This optimizes for where concepts are most readable, not where injection is most effective. In 72B models, d′ peaks at 60–90% depth, leaving too few downstream layers for the perturbation to propagate into logits. Empirically, injecting at 44–56% depth in 72B produces much stronger behavioral effects — verbosity, for example, goes from unsteerable at the d′-optimal layer to reliably controllable at the empirically optimal one. This is a known limitation of the v1 intrinsic approach; the extrinsic psychometric sweep captures the discrepancy, and a future iteration may use behavioral judges directly for layer selection.

Quality grading and orthogonality

Each concept receives a quality grade based on its peak separability: strong (d′ > 2.0 or accuracy > 0.85), moderate (d′ > 1.0 or accuracy > 0.70), or weak. A cross-concept orthogonality matrix measures the cosine similarity between every pair of probe vectors. High overlap (cosine > 0.6) warns that two concepts may interfere when steered simultaneously — for example, agreeableness and cooperation show a cosine of 0.86 in Qwen 2.5, meaning steering one will significantly affect the other.

Extrinsic calibration: the psychometric sweep

Intrinsic metrics measure what happens inside the network. But the question that matters for steering is different: can a reader tell? A concept can show a large d′ in the activation space while producing no perceptible change in the model's output. Conversely, a moderate activation shift might produce a dramatic behavioral effect. The only credible evaluation of steering is to test whether its effects are detectable in the generated text.

Separability is not steerability. The psychometric sweep measures what actually matters: the probability that an observer can detect the behavioral shift at a given steering intensity, without encountering incoherence.

Blind paired comparison

For each concept at each alpha value, the pipeline generates paired responses: one with steering applied, one without. These are presented to a judge in randomized order (A/B), with no indication of which is steered. The judge makes two assessments:

Coherence check

Is either response incoherent, garbled, or not meaningful text? The judge reports BOTH_OK, A_BAD, B_BAD, or BOTH_BAD. If the steered response is incoherent, the trial is scored as a degradation event regardless of any behavioral shift.

Steering detection

If both responses are coherent, which one demonstrates more of the target concept? For negative alphas (reverse steering), the question is inverted: which demonstrates less. The judge may also report UNSURE if neither response is distinguishable.

The A/B ordering is deterministically randomized per trial (seeded by a hash of the concept, alpha, and trial number) to prevent any positional bias. This protocol was validated by comparing three independent judges (two local models and Claude Sonnet), achieving 79–86% inter-judge agreement on definite verdicts.

Psychometric curves

Aggregating across trials at each alpha produces two curves per concept:

Detection probability Peff(α) = fraction of coherent trials where the judge correctly identified the steered response

A score of 0.5 is chance (the judge is guessing). Scores above 0.75 indicate reliable detection — the steering is producing a clearly perceptible behavioral shift.

Degradation probability Pdeg(α) = fraction of trials where the steered response was judged incoherent

Rises sharply above a concept-specific threshold. Below that threshold, the model remains coherent; above it, steering has overwhelmed the model's language capacity.

The intersection of these two curves defines the therapeutic window: the alpha range where Peff > 0.75 (steering is reliably detectable) and Pdeg < 0.05 (coherence is preserved). This window is the primary output of calibration and directly determines the recommended alpha for each concept.

Scale of the evaluation

The full production sweep covers all 24 concepts across 15 alpha values (including negative alphas for reverse steering and zero as a null control), with 20 trials per condition. At each of three model scales — 7B, 14B, and 72B parameters — this requires 7,200 response pairs (14,400 generations) and 7,200 judge evaluations per model. The full calibration across all three models involves over 43,000 generations and 21,000 judge evaluations.

The evaluation is sharded across GPU nodes via SLURM, with a coordinator managing generation and judging jobs in parallel. Results are cached per phase with checkpoint/resume support, since the full sweep requires substantial GPU hours.

Cross-model calibration

The full calibration pipeline has been completed for three Qwen 2.5 model scales, revealing how steering characteristics change with model size:

Model Layers Hidden dim Probe scale Steer scale Concepts Intrinsic time
Qwen 2.5‑7B‑Instruct 28 3,584 0.0099 0.0595 24 ~12 min
Qwen 2.5‑14B‑Instruct‑AWQ 48 5,120 0.0067 0.0403 24 ~24 min
Qwen 2.5‑72B‑Instruct‑AWQ 80 8,192 ~0.001 ~0.007 24 ~50 min

Intrinsic calibration times on A100-80GB. Larger models have smaller norm scale values because their activation norms are proportionally larger — the scaling ensures comparable effective perturbation magnitudes across model sizes. The extrinsic psychometric sweep adds substantially more compute on top of these figures, depending on the number of alpha values and trials per condition.

Checkpointing and preemption resilience

The full calibration pipeline — intrinsic plus extrinsic — can run for many GPU-hours at 72B scale. On shared HPC scientific computing infrastructure with SLURM job scheduling, preemption can interrupt a job at any point. Panopticon's calibration supports checkpoint/resume: each phase saves its output as JSON upon completion, and on restart, completed phases are loaded from checkpoint and skipped. This makes calibration robust to the gpu_requeue partition model where jobs are regularly preempted and rescheduled.

Prompt screening

An additional investigation revealed that several of the 24 standard extraction prompts pre-saturate certain concept directions — the model's unsteered response already maximally expresses the concept, creating a ceiling effect that masks the steering signal. A prompt screening phase identifies concept-neutral alternatives for each vector, improving both extraction quality and evaluation sensitivity.

Output: the model profile

Both stages produce a single model_profile.json: a comprehensive psychometric profile of the model. For each concept, it contains the optimal probe and inject layers, d′ curves across all layers, quality grade, baseline statistics, the psychometric operating range, adaptive controller parameters, and the full orthogonality matrix. For the model globally, it contains per-layer activation norm statistics and the norm scaling factors.

At runtime, Panopticon loads this profile alongside the steering vectors. Parameters are resolved in a strict precedence chain: explicit per-request configuration wins over the calibration profile, which in turn overrides heuristic defaults (e.g., 0.65× depth for probe, 0.6× for inject). Every parameter that calibration discovers — layers, sensitivity, modulation floor, both norm scales, alpha range — becomes the default for that concept. In practice the profile eliminates manual tuning entirely.

OpenAI-compatible, with extensions

Panopticon exposes an OpenAI-compatible chat completions API, extended with a steering field. Existing clients work unchanged for uninstrumented requests; steering is purely additive.

Endpoints

POST /v1/chat/completions — standard chat with optional steering and telemetry.

GET /v1/vectors — list loaded vectors with metadata, quality grades, and alpha ranges.

GET /v1/calibration — return the full model profile.

GET /health — model status, loaded vectors, calibration readiness.

POST /v1/vectors/extract — extract a new steering vector from contrastive prompt pairs.

Transport

stdio — JSONL on stdin/stdout. Best for interactive research. SSH provides authentication and encryption. The model loads on connect and unloads on disconnect.

HTTP — FastAPI with Server-Sent Events for streaming. Best for persistent deployment on GPU clusters. Designed to run as a SLURM batch job with SSH tunnels for secure remote access.

Both transports share the same engine and steering logic. The choice is purely operational.

HPC deployment

Panopticon is designed for deployment on scientific computing infrastructure. A launcher manages the full lifecycle: SLURM job submission, model loading, SSH tunnel establishment, and health monitoring. The system automatically reconnects when tunnels drop and resubmits jobs when allocations expire.

Launcher (control plane) │ ├── Submit SLURM job ──────▶ GPU node │ ├── Load model + vectors │ ├── Run calibration │ └── Start HTTP server (localhost:9400) │ ├── Establish SSH tunnel ──▶ localhost:9400 ◀── GPU node:9400 │ ├── Health monitor ────────▶ GET /health (every 30s) │ └── On failure: reconnect tunnel or resubmit job │ └── Web server ────────────▶ Client browsers (port 7860)

Interactive steering

The Panopticon Steering Demo provides an interactive interface for exploring steered generation. A dual-pane view shows unsteered (baseline) and steered responses side by side. Sparkline visualizations display real-time probe scores across all loaded concepts.

Browser interface

Select any concept and adjust its weight with a slider. Both panes receive the same prompt; only the right pane applies steering. Passive probes run on both sides, so you can compare the behavioral fingerprints of steered and unsteered generation in real time.

The interface supports three modes: side-by-side comparison, conversation (single-pane voice chat with live steering), and looping (the model generates one sentence at a time on a fixed prompt while the operator adjusts steering between iterations).

Hardware control

A MIDI controller bridge connects a Novation Launchpad to the steering server. The 8×8 grid maps concepts to rows, with columns indicating probe intensity (during generation) or weight (when setting steering). Scene buttons cycle through weight presets. LED colors match the browser's concept palette.

Voice mode enables spoken conversation with the steered model: speech-to-text on the controller, sentence-streaming TTS with telemetry synchronized to audio playback, and real-time probe visualization on the pad's LED grid.