Junkyard Dog

Sniff out open pain datasets, fetch them, and wire them together with code.

An agentic challenge: search for open pain-science datasets, inspect their schemas, download them, and harmonize them into one analyzable table.

Point the agent at the open web, have it fetch real data, and guide it as it stitches several messy tables into one you can actually analyze.

35 min Advanced · Claude Code / Codex teams of 3-5 Agentic CLI

The goal

The Goal Produce one tidy table that combines self-reported pain ratings with basic demographics (age, sex) from at least two independent open datasets, with a column marking which dataset each row came from. Every pain rating must be on a common 0-10 scale, and you must be able to point to the exact source file and license for every value. We are not telling you which datasets, which scale conversions, or which tools to chain. That is the hunt.

Why it matters

Pain has no biomarker, so the field runs on self-report, and that self-report is scattered across a hundred incompatible spreadsheets: one study codes sex as 1/2, another as M/F; one measures pain on a 0-100 VAS, the next on a 0-10 NRS, the next on the McGill. Half the labor of any meta-analysis, secondary analysis, or “let me just check whether this replicates” is the unglamorous reconciliation work, and it is exactly the kind of multi-step, schema-wrangling task that agentic tools handle well. If the AI can find the data, read the data dictionary, and write the harmonization code while you supervise, a job that used to take days can fit into an afternoon. Build this skill and you stop being blocked on “I wish someone had already merged these.”

Run of show

0:00–0:05 · Challenge introduction (5 min)
0:05–0:20 · Work in your group (15 min)
0:20–0:22 · Post your best prompt (2 min)
0:22–0:32 · Share & debrief (10 min)
0:32–0:35 · Reset (3 min)

Bad prompt to better prompt

Weak prompt

Find some open pain datasets and merge them into one table.

The agent grabs the first two CSVs it can find, concatenates columns with mismatched names, silently drops rows where a join key is missing, leaves three different pain scales side by side as if they were comparable, and reports “Merged successfully.” You have no idea where any number came from, the scales are incommensurable, and the “merge” is unusable.

Strong prompt

You are a research data engineer assembling a harmonized pain dataset. GOAL: one tidy table with columns [subject_id, dataset, age, sex, pain_0_10]. Constraints: (1) use at least two independent open datasets that include self-reported pain plus age and sex; (2) before writing any merge code, fetch and quote each dataset’s data dictionary / codebook, and show me the raw column names, units, and coding for sex and pain; (3) STOP and show me your proposed mapping table (source column -> target column, plus any scale conversion formula, e.g. VAS_0_100/10) and wait for my approval before transforming anything; (4) convert all pain measures to a 0-10 scale and state the formula in a comment; (5) keep a source_file and license column on every row; (6) do not drop rows silently — report counts before and after each step and list anything you discarded and why. Work step by step and pause at the approval gate.

It wins because it names the target schema, forces the agent to read the codebook before coding, inserts a human approval gate at the riskiest moment (the variable mapping), demands provenance and row-count accounting, and bans the silent data loss that makes the bad version dangerous.

Prompting moves to try

Decompose the hunt. Make the agent split the job into explicit stages — search -> shortlist with license check -> fetch -> inspect schema -> propose mapping -> transform -> verify — and report at the end of each. Long agentic runs drift; named checkpoints keep it honest.
Role prompting. “You are a research data engineer / a reproducibility auditor” pulls in different behavior than the default helpful assistant — more provenance, more skepticism, more assert statements.
Adversarial self-evaluation. After it produces the merged table, ask: “Now act as a hostile reviewer. List the three most likely ways this harmonization is wrong, and give a 0-100 confidence that each pain value is correctly scaled.” Watch the confidence drop, then chase the cheap fixes.
Insert an approval gate. Tell it to STOP and show the proposed source-to-target mapping before transforming. The mapping is where 90% of harmonization bugs live.
Ask it to improve your prompt. “Before you start, rewrite my request to remove ambiguity and list the assumptions you’re about to make.” Cheap, and it surfaces the silent decisions (which join key? what about duplicate subjects?) before they cost you.
Demand a verification artifact. Ask for a short Markdown provenance log alongside the table, so the work is auditable after the agent’s context is gone.

Starter materials

Where to sniff (open-data hunting grounds)

OpenPain (openpain.org) — chronic pain neuroimaging + clinical/behavioral data from Apkarian lab studies.
OpenNeuro — BIDS-formatted neuroimaging datasets; search “pain”, “CPM”, “thermal”, “fibromyalgia”; participants.tsv files carry age/sex/pain.
NIH dbGaP / NIDA Data Share — clinical trials incl. opioid and chronic pain studies (note: controlled-access tiers exist; use only what is genuinely open).
UK Data Service / ICPSR — survey datasets with chronic pain and demographic items.
Zenodo, Figshare, OSF, Dryad — supplementary data from published pain papers; filter by license (look for CC0 / CC-BY).
PhysioNet — physiological + some pain-related signal datasets (credentialing required for some).
Government health surveys — NHANES, NHIS include validated pain items plus rich demographics, fully open.

Target schema (paste this to the agent)

subject_id : string, unique within dataset dataset : string, short label e.g. “nhanes_2017” or “openneuro_ds00XXXX” age : numeric, years sex : categorical, normalized to {“female”,“male”,“other/unknown”} pain_0_10 : numeric, 0-10 NRS; all source scales converted to this source_file: string, filename or URL the row came from license : string, e.g. “CC0-1.0”, “CC-BY-4.0”, “NHANES public domain”

A deliberately messy source table (Dataset A — paste as-is)

This is the kind of thing the agent will meet in the wild. Hand it this and ask it to map it into the target schema.

ptid	yrs	gender	vas_pain	notes
A-01	34	1	72	“moderate-severe”
A-02	51	2	0	no pain today
A-03	28	1	95	flare
A-04	63	2	40
A-05	45	9	58	gender not reported

Codebook for Dataset A: gender coded 1 = male, 2 = female, 9 = missing. vas_pain is a 0-100 visual analog scale. License: CC-BY-4.0.

A second, differently-messy source (Dataset B — paste as-is)

id	age_y	sex	nrs	scale_max
B100	39	F	6	10
B101	22	M	9	10
B102	57	F	2	10
B103	48	U	7	10

Codebook for Dataset B: sex coded F/M/U. nrs is a 0-10 numeric rating scale (scale_max confirms). License: CC0-1.0.

The trap: Dataset A’s pain is 0-100 and must be divided by 10; A’s sex coding (1=male) is the reverse of what people often assume; both have a missing/unknown sex category that must map to “other/unknown”, not be dropped.

Data-provenance & verification checklist (the deliverable’s report card)

Every dataset has a recorded license, and that license permits reuse and redistribution of derived tables.
A source-to-target mapping table exists, showing each original column and its transformation.
Every scale conversion formula is written down (e.g. pain_0_10 = vas_pain / 10).
Row counts are reported before and after each step; any dropped rows are listed with a reason.
No silent missing-data handling — 9, U, blank, and NA are explicitly accounted for.
pain_0_10 values are all within [0, 10] (spot-check the min/max).
sex contains only the three allowed categories.
Each row carries a working source_file reference you could re-download from.
The agent’s confidence claims were checked against the actual data, not taken on faith.

Debrief questions

Where did the agent get something subtly wrong — a flipped sex code, a scale left unconverted, a row quietly dropped — and would you have caught it without the checklist?
Which tool found better data, and was that about the model or about how the team framed the search?
What did the approval gate (showing the mapping before transforming) save you from, if anything?
How confident are you that you could hand this table to a collaborator and have them trust every number? What’s still missing?
When the agent claimed high confidence, was it calibrated? Where was it most overconfident?

Level up

Add a third dataset with a genuinely different pain instrument (e.g. McGill Pain Questionnaire or PROMIS Pain Interference) and have the agent justify, in writing, whether and how it can be mapped to 0-10 at all — sometimes the honest answer is “it can’t, keep it separate.”
Make it reproducible: have the agent emit a single script plus a requirements/environment file and a README that re-downloads everything from scratch, so a stranger could rebuild your table.
Write the data statement: ask the agent to draft the “Data Availability” and “Data harmonization” paragraphs you’d put in a methods section, citing each source and license correctly.

Ethics

Responsible use Only fetch and combine data that is genuinely open, and read the license before you redistribute anything — CC-BY needs attribution, some “public” datasets forbid re-sharing, and controlled-access tiers (much of dbGaP) are off-limits without an approved application. Never feed restricted, identifiable, or credentialed data into a cloud agent. Harmonizing across studies can re-identify people even when each source was de-identified, so resist combining quasi-identifiers beyond what your goal needs. And treat the agent’s harmonization as a draft under audit, not as truth: a confidently merged table with a flipped scale is more dangerous than no table at all, because it looks finished.

← Challenge menu AI Toolkit →