IRB Speed Round

Draft a risks & benefits section fast — and catch the AI when it bluffs.

From a brief study description, draft a risks/benefits section — then hunt down the fabricated regs and invented numbers the AI slipped in.

You have a study, a deadline, and an AI that will readily cite a federal regulation that does not exist — your job is to draft fast and then catch what it gets wrong.

35 min Intermediate teams of 3-5 Any chat AI

The goal

The Goal Produce a tight, IRB-ready Risks & Benefits section for the study described below — and, alongside it, a verification log that flags every factual claim the AI made that you could not confirm. A “winning” team turns in copy that a real IRB reviewer would nod at and a list of the bluffs they caught. We are deliberately not telling you how to prompt for either one. That is the puzzle.

Why it matters

Human-subjects protocols are where good science meets institutional reality. Every pain study you ever run — a thermal QST battery, a cuff-pressure paradigm, an opioid challenge, an fMRI scan in people who cannot lie flat without flaring — needs a risks-and-benefits section that is honest, specific, and correctly grounded in regulation. AI can get you from blank page to credible draft in minutes, which is genuinely useful when an IRB submission is due Friday.

It is also one of the clearest places to watch an LLM hallucinate with full confidence. It will invent regulatory citations (a plausible-looking “45 CFR” subpart that does not say what it claims), assign false numeric risk rates (“first-degree burns occur in approximately 2.3% of QST sessions”), and confidently mis-describe SAR limits or contrast-agent risks for a scan that does not even use contrast. If you cannot catch that in a low-stakes workshop, you are less likely to catch it in your grant, your methods section, or your consent form. This challenge builds the habit that matters most: draft fast, then verify before you trust.

Run of show

0:00–0:05 · Challenge introduction (5 min)
0:05–0:20 · Work in your group (15 min)
0:20–0:22 · Post your best prompt (2 min)
0:22–0:32 · Share & debrief (10 min)
0:32–0:35 · Reset (3 min)

Bad prompt to better prompt

Weak prompt

Write the risks and benefits section for my fMRI pain study and cite the relevant regulations.

The output reads fine and is almost entirely untrustworthy. It produces generic boilerplate (“participants may experience mild discomfort”), invents precise-sounding risk percentages with no source, and drops in official-looking citations like “per 45 CFR 46.404, thermal stimuli are limited to 50 degrees C” — a regulation number that exists but says nothing of the kind. Worst of all, it never tells you which claims it made up, so you cannot tell the real risks from the confabulated ones.

Stronger prompt

You are an experienced IRB analyst at a US academic medical center. Below is a study description. Draft a Risks & Benefits section in two columns: (1) the section text, and (2) a parallel verification log.

For EVERY factual claim — each risk, each probability or severity rating, and each regulatory or guideline citation — add a tag in the log: [VERIFIED] only if it is common, uncontroversial knowledge you are highly confident about; [CHECK] if it is plausible but I must confirm it against a primary source; [UNCERTAIN/POSSIBLY FABRICATED] if you are inferring or guessing. Do NOT invent specific numeric incidence rates; if you state one, mark it [CHECK] and name exactly what document I should look it up in.

After the draft, list the 5 claims most likely to be wrong and tell me how to verify each. Then rate your overall confidence 0-100 and explain the score.

STUDY: [paste the study description]

It works because it forces the model to separate prose it can fluently generate from facts it actually knows, bans the most dangerous behavior (made-up incidence rates) unless explicitly flagged, and ends with adversarial self-critique plus a calibrated confidence score — turning the AI into both drafter and its own first reviewer.

Prompting moves to try

Decompose the goal. Split the task into “draft the prose” and “audit the facts” as two passes (or two columns). Quality jumps when the model is not doing both at once.
Role / identity prompting. “You are an IRB analyst” vs. “you are the skeptical reviewer who rejects unsupported claims” produce very different drafts. Run both and diff them.
Adversarial self-evaluation. After the draft, ask: “Act as a hostile IRB reviewer. List every claim you cannot defend and every citation you are not certain is real. Score your own confidence 0-100 per claim.” Watch the confidence collapse on the fabricated bits.
Bait the bluff. Explicitly ask for “the exact CFR subpart and a one-line quote of what it says.” Fabrications get more obvious when the model has to produce a verbatim quote it does not have.
Ask it to improve your prompt. “Rewrite my prompt so the output is harder to hallucinate in. What did you change and why?” Then run the upgraded version.
Force a source map. Require a final line: “Which of these claims would survive a Google Scholar / official .gov check, and which would not?”

Starter materials

Paste the study description below into your AI. The detail is deliberate — enough to draft from, with several traps where a careless model will fabricate.

Study description (paste this)

Title: Central sensitization signatures in chronic low back pain: a combined QST, thermal pain, and fMRI study.

Population: 60 adults (ages 18-65) with chronic low back pain (≥3 months, average intensity ≥4/10) and 30 pain-free controls. Excludes pregnancy, active substance use disorder, MRI contraindications (pacemaker, ferromagnetic implants, claustrophobia), and current opioid use above 30 morphine milligram equivalents/day.

Session 1 (≈2 hr, behavioral lab): Quantitative sensory testing (QST) per a standardized battery — thermal detection and pain thresholds via a contact thermode on the forearm and lower back (cutoff 50 degrees C), pressure pain thresholds via handheld algometer, and a brief conditioned pain modulation task using a cold-water immersion conditioning stimulus (hand in ~10 degrees C water for up to 60 s, participant may withdraw at any time).

Session 2 (≈1 hr, 3T MRI): Structural and functional MRI. During the scan, participants receive individually calibrated noxious heat stimuli (target 5-6/10 pain) to the lower leg via an MRI-compatible thermode, and rate pain on a visual analog scale. No contrast agent is used. Standard 3T scanner with manufacturer-standard SAR limits.

Compensation: 75 USD per session. Participants may stop at any time without penalty.

Data: De-identified; neuroimaging stored on an institutional secure server; a coded key links to a separate access-controlled file.

Verification log template (paste this)

Have the AI fill this in, one row per factual claim:

Planted traps — see how many your team catches These are the kinds of bluffs models commonly produce on this exact study. Do not show participants until the debrief.

Invented incidence numbers. e.g. “thermal QST causes burns in ~1-3% of sessions” — there is no such established rate; a 50 degrees C cutoff thermode is designed to prevent tissue damage.
Fabricated or misapplied citations. e.g. “per 45 CFR 46.404…” (that subpart is about research on children, not thermal limits), or a made-up “FDA guideline” on cold-pressor duration.
Phantom risks. Listing gadolinium / contrast-agent risks, or IV-line risks — this study uses no contrast and no IV.
Wrong regulatory framing. Calling cold-water immersion or calibrated heat “greater than minimal risk” or “minimal risk” without justification, or citing a specific SAR number it cannot source.
Overstated direct benefit. Claiming participants will receive diagnostic or therapeutic benefit from a research scan, which they generally will not.

Debrief questions

Which claims did the AI mark high-confidence that turned out to be fabricated? What did “confident and wrong” look like — and would you have caught it without the verification log?
Of the planted traps, which did your team catch and which slipped through? What made the missed ones convincing?
Did adding a role (“hostile reviewer”) or a self-scoring step actually change what the AI was willing to flag, or just change the tone?
Where is the line between a risk the AI can reasonably state (discomfort from heat) and a number it must never invent (a burn incidence rate)?
If you had 60 more seconds before submitting this to a real IRB, what is the single highest-value thing you would verify by hand?

Level up

Cross-examine two models. Run the same prompt on two different AIs and have a third pass adjudicate: “Here are two risk sections. Which claims do they disagree on, and which disagreement signals a likely hallucination?”
Build a reusable verification harness. Turn your best prompt into a saved template that ends with a calibration table and a “claims to verify” list — something you could reuse for your next real protocol’s first draft.
Trace one citation to ground truth. Pick the single most official-sounding citation the AI produced and actually find the primary source (or prove it does not exist). Report back what the regulation really says.

Ethics

Responsible use AI can draft a first pass of a risks/benefits section, but you are accountable for every word submitted to an IRB. Never submit AI-generated regulatory citations, incidence rates, or risk language without verifying each against a primary source — fabricated regs in a protocol are a real integrity problem, not a typo. Use AI on your own draft protocols only; do not paste collaborators’ unpublished study designs into tools without permission, and never use it to process or fast-track someone else’s confidential submission you have been assigned to review. The verification log is not busywork — it is the part you sign your name to.

← Challenge menu AI Toolkit →