Junkyard Dog
Sniff out open pain datasets, fetch them, and wire them together with code.
Point the agent at the open web, have it fetch real data, and guide it as it stitches several messy tables into one you can actually analyze.
The goal
The Goal Produce one tidy table that combines self-reported pain ratings with basic demographics (age, sex) from at least two independent open datasets, with a column marking which dataset each row came from. Every pain rating must be on a common 0-10 scale, and you must be able to point to the exact source file and license for every value. We are not telling you which datasets, which scale conversions, or which tools to chain. That is the hunt.
Why it matters
Pain has no biomarker, so the field runs on self-report, and that self-report is scattered across a hundred incompatible spreadsheets: one study codes sex as 1/2, another as M/F; one measures pain on a 0-100 VAS, the next on a 0-10 NRS, the next on the McGill. Half the labor of any meta-analysis, secondary analysis, or “let me just check whether this replicates” is the unglamorous reconciliation work, and it is exactly the kind of multi-step, schema-wrangling task that agentic tools handle well. If the AI can find the data, read the data dictionary, and write the harmonization code while you supervise, a job that used to take days can fit into an afternoon. Build this skill and you stop being blocked on “I wish someone had already merged these.”
Run of show
- 0:00–0:05 · Challenge introduction (5 min)
- 0:05–0:20 · Work in your group (15 min)
- 0:20–0:22 · Post your best prompt (2 min)
- 0:22–0:32 · Share & debrief (10 min)
- 0:32–0:35 · Reset (3 min)
Bad prompt to better prompt
The agent grabs the first two CSVs it can find, concatenates columns with mismatched names, silently drops rows where a join key is missing, leaves three different pain scales side by side as if they were comparable, and reports “Merged successfully.” You have no idea where any number came from, the scales are incommensurable, and the “merge” is unusable.
source_file and license column on every row; (6) do not drop rows silently — report counts before and after each step and list anything you discarded and why. Work step by step and pause at the approval gate.
It wins because it names the target schema, forces the agent to read the codebook before coding, inserts a human approval gate at the riskiest moment (the variable mapping), demands provenance and row-count accounting, and bans the silent data loss that makes the bad version dangerous.
Prompting moves to try
- Decompose the hunt. Make the agent split the job into explicit stages — search -> shortlist with license check -> fetch -> inspect schema -> propose mapping -> transform -> verify — and report at the end of each. Long agentic runs drift; named checkpoints keep it honest.
- Role prompting. “You are a research data engineer / a reproducibility auditor” pulls in different behavior than the default helpful assistant — more provenance, more skepticism, more
assertstatements. - Adversarial self-evaluation. After it produces the merged table, ask: “Now act as a hostile reviewer. List the three most likely ways this harmonization is wrong, and give a 0-100 confidence that each pain value is correctly scaled.” Watch the confidence drop, then chase the cheap fixes.
- Insert an approval gate. Tell it to STOP and show the proposed source-to-target mapping before transforming. The mapping is where 90% of harmonization bugs live.
- Ask it to improve your prompt. “Before you start, rewrite my request to remove ambiguity and list the assumptions you’re about to make.” Cheap, and it surfaces the silent decisions (which join key? what about duplicate subjects?) before they cost you.
- Demand a verification artifact. Ask for a short Markdown provenance log alongside the table, so the work is auditable after the agent’s context is gone.
Starter materials
Where to sniff (open-data hunting grounds)
- OpenPain (openpain.org) — chronic pain neuroimaging + clinical/behavioral data from Apkarian lab studies.
- OpenNeuro — BIDS-formatted neuroimaging datasets; search “pain”, “CPM”, “thermal”, “fibromyalgia”; participants.tsv files carry age/sex/pain.
- NIH dbGaP / NIDA Data Share — clinical trials incl. opioid and chronic pain studies (note: controlled-access tiers exist; use only what is genuinely open).
- UK Data Service / ICPSR — survey datasets with chronic pain and demographic items.
- Zenodo, Figshare, OSF, Dryad — supplementary data from published pain papers; filter by license (look for CC0 / CC-BY).
- PhysioNet — physiological + some pain-related signal datasets (credentialing required for some).
- Government health surveys — NHANES, NHIS include validated pain items plus rich demographics, fully open.
Target schema (paste this to the agent)
A deliberately messy source table (Dataset A — paste as-is)
This is the kind of thing the agent will meet in the wild. Hand it this and ask it to map it into the target schema.
| ptid | yrs | gender | vas_pain | notes |
|---|---|---|---|---|
| A-01 | 34 | 1 | 72 | “moderate-severe” |
| A-02 | 51 | 2 | 0 | no pain today |
| A-03 | 28 | 1 | 95 | flare |
| A-04 | 63 | 2 | 40 | |
| A-05 | 45 | 9 | 58 | gender not reported |
Codebook for Dataset A: gender coded 1 = male, 2 = female, 9 = missing. vas_pain is a 0-100 visual analog scale. License: CC-BY-4.0.
A second, differently-messy source (Dataset B — paste as-is)
| id | age_y | sex | nrs | scale_max |
|---|---|---|---|---|
| B100 | 39 | F | 6 | 10 |
| B101 | 22 | M | 9 | 10 |
| B102 | 57 | F | 2 | 10 |
| B103 | 48 | U | 7 | 10 |
Codebook for Dataset B: sex coded F/M/U. nrs is a 0-10 numeric rating scale (scale_max confirms). License: CC0-1.0.
The trap: Dataset A’s pain is 0-100 and must be divided by 10; A’s sex coding (1=male) is the reverse of what people often assume; both have a missing/unknown sex category that must map to “other/unknown”, not be dropped.
Data-provenance & verification checklist (the deliverable’s report card)
Debrief questions
- Where did the agent get something subtly wrong — a flipped sex code, a scale left unconverted, a row quietly dropped — and would you have caught it without the checklist?
- Which tool found better data, and was that about the model or about how the team framed the search?
- What did the approval gate (showing the mapping before transforming) save you from, if anything?
- How confident are you that you could hand this table to a collaborator and have them trust every number? What’s still missing?
- When the agent claimed high confidence, was it calibrated? Where was it most overconfident?
Level up
- Add a third dataset with a genuinely different pain instrument (e.g. McGill Pain Questionnaire or PROMIS Pain Interference) and have the agent justify, in writing, whether and how it can be mapped to 0-10 at all — sometimes the honest answer is “it can’t, keep it separate.”
- Make it reproducible: have the agent emit a single script plus a
requirements/environment file and a README that re-downloads everything from scratch, so a stranger could rebuild your table. - Write the data statement: ask the agent to draft the “Data Availability” and “Data harmonization” paragraphs you’d put in a methods section, citing each source and license correctly.
Ethics
Responsible use Only fetch and combine data that is genuinely open, and read the license before you redistribute anything — CC-BY needs attribution, some “public” datasets forbid re-sharing, and controlled-access tiers (much of dbGaP) are off-limits without an approved application. Never feed restricted, identifiable, or credentialed data into a cloud agent. Harmonizing across studies can re-identify people even when each source was de-identified, so resist combining quasi-identifiers beyond what your goal needs. And treat the agent’s harmonization as a draft under audit, not as truth: a confidently merged table with a flipped scale is more dangerous than no table at all, because it looks finished.