Skip to content

Designing and Running a Study

A study is a research question asked on top of one or more scenarios. It defines: - the hypotheses you want to test - the conditions (what you vary between runs) - the scenarios you run them on - how many replicate seeds to use

Studies live in experiments/studies/<study_name>/ and are version-controlled. The scenario is the stage; the study is the experiment.

Shortcut: If you are using a coding agent (Claude Code, Cursor, etc.), you can type /new-study to be guided through this process interactively.


Concepts

Scenario — A shared social world (agents, backend, event). Reusable across studies.

Hypothesis — A falsifiable claim about what will happen if you vary something. Example: "Larger LLMs produce more stylistically diverse posts."

Condition — One value of the independent variable. A hypothesis has 2+ conditions. Example: sim.llm.name=gpt-4o-mini and sim.llm.name=gpt-4o.

Run — One simulation of one (condition x scenario x seed) combination.

Evaluation — A script that reads a completed run and produces metrics (eval.json).


Step 1 — Create the study directory

experiments/studies/my_study/
  study.yaml      ← you write this
  eval.py         ← you write this (or use built-in presets)
  notebook.ipynb  ← you create this after runs complete

Step 2 — Write study.yaml

This is the single source of truth. It describes your research question, hypotheses, and maps each condition to concrete run configurations.

schema_version: 1

study:
  name: my_study
  question: >-
    Does a richer agent persona produce more stylistically distinct posts?
  scenarios:
    - neighborhood_forum
    - hobby_collective
  run_defaults:
    config_path: scenarios/{scenario}/conf
    seed_start: 42
    seed_repeats: 3          # runs seeds 42, 43, 44 for each condition x scenario
    overrides:
      num_steps: 10

hypotheses:
  h1_persona_richness:
    statement: >-
      Agents with detailed backstory context produce more stylistically
      diverse posts than agents with minimal context.
    independent_variable: persona
    prediction: >-
      Rich persona condition will show higher inter-agent distinctiveness
      across both scenarios.
    status: testing
    conditions:
      rich:
        overrides:
          agents: default       # uses scenarios/<name>/conf/agents/default.yaml
      thin:
        overrides:
          agents: thin          # uses scenarios/<name>/conf/agents/thin.yaml

Key fields:

Field What it does
study.scenarios List of scenarios to run each condition on
run_defaults.seed_start + seed_repeats Expands to N consecutive seeds per run
hypotheses.<id>.conditions.<name>.overrides Hydra CLI overrides for this condition
hypotheses.<id>.status testingsupported / refuted / inconclusive

The overrides values are passed directly to uv run silisocs as CLI overrides (e.g. agents=thin--config-path ... agents=thin).


Step 3 — Write eval.py (or use built-in presets)

eval.py reads a completed run directory and writes eval.json with your metrics.

Minimal interface:

uv run python experiments/studies/my_study/eval.py \
    --run-dir outputs/neighborhood_forum_experiment/2026-05-01T10-00-00 \
    --output  path/to/eval.json

To use the built-in action and probe metrics without a custom script, add to study.yaml:

evaluations:
  - id: action_metrics
    preset: builtin.action_metrics_detailed
  - id: probe_metrics
    preset: builtin.probe_metrics_detailed

If you need custom metrics (e.g. lexical diversity, inter-agent distinctiveness), write eval.py. See docs/study_schema.md for the required output format.


Step 4 — Run the study

# Plan: preview what will be run without executing
uv run python -m experiments.run_study \
    --study experiments/studies/my_study plan

# Run all conditions × scenarios × seeds
uv run python -m experiments.run_study \
    --study experiments/studies/my_study run

# Run only one hypothesis
uv run python -m experiments.run_study \
    --study experiments/studies/my_study run \
    --only-hypothesis h1_persona_richness

The runner writes deterministic run records and organized artifacts. To reuse existing outputs instead of running a condition again, set execution.mode: reuse_existing and list the prior output paths under reuse.runs in study.yaml.

Outputs land in:

experiments/studies/my_study/runs/h1_persona_richness/persona=rich/neighborhood_forum/seed_42/run/


Step 5 — Analyse results

After runs complete, open or create experiments/studies/my_study/notebook.ipynb.

The standard notebook structure has 9 sections: 1. Title + setup (load study.yaml and all eval.json files) 2. Study overview (hypothesis table, condition summary) 3. Key metrics explained 4. Headline comparison (grouped bar chart) 5. Full metric profile (radar chart) 6. Scenario consistency (faceted by scenario) 7. Per-agent distributions (strip plots) 8. Behavioral breakdown (action type counts) 9. Takeaways (key findings, limitations, next steps)

See docs/study_schema.md for the full notebook conventions.


Step 6 — Record findings and add follow-up hypotheses

When a hypothesis is complete, update its status and add a finding:

h1_persona_richness:
  status: supported
  finding: >-
    Rich persona condition showed 2.4× higher inter-agent distinctiveness
    than thin condition across both scenarios.

To add a follow-up hypothesis motivated by this finding:

h2_model_capacity:
  follows_from: h1_persona_richness
  motivation: >-
    H1 confirmed persona richness drives diversity. H2 asks whether model
    capacity amplifies or dampens this effect.
  statement: ...
  independent_variable: model
  status: testing
  conditions:
    gpt4o-mini:
      overrides: {llm.name: gpt-4o-mini}
    gpt4o:
      overrides: {llm.name: gpt-4o}

Reusing runs across hypotheses

If a condition from an earlier hypothesis serves as the control for a later one, reference the same run paths in both condition entries. The organizer handles duplicate paths — the run is not re-executed, just re-linked.


Where to look next

  • Full study.yaml schema: docs/study_schema.md — all fields, file formats, eval.json spec
  • Scenario design: docs/scenario_guide.md — how to build the scenario you study
  • Existing studies: experiments/studies/style_diversity/ — a working example
  • Run study CLI help: uv run python -m experiments.run_study --help