Hypothesis Roulette

A surprising null result walks in. Spin up five testable mechanisms.

A core live challenge: hand the AI a baffling pain-science null result and race to generate five testable mechanistic hypotheses, then stress-test them for feasibility and a discriminating experiment.

The data refused to cooperate. Now turn one stubborn null into five hypotheses worth testing.

35 min Intermediate Teams of 3-5 Any chat AI Recommended core challenge

The goal

The Goal Take one surprising null result and walk out with five genuinely different, testable mechanistic hypotheses that could explain it — plus, for your team’s favorite, one discriminating experiment that would tell two of those hypotheses apart. The room then votes: which hypothesis is the most creative, and which is the most feasible? (They are rarely the same one, and that gap is the whole lesson.)

Why it matters

Null results are often where a project either stalls or pivots. A flat line in your data is not a dead end — it is a fork with several unmarked roads, and the skill is seeing them quickly before settling on the obvious one. Reviewers, PIs, and grant panels reward the researcher who can say “here are the competing explanations, and here is the experiment that adjudicates between them.”

That generative, divergent step is where a chat AI works well as a thinking partner: it has no attachment to your preferred theory, it will propose mechanisms you might hesitate to raise, and — if you ask it to — it will critique its own ideas on feasibility. You stay the scientist of record; it keeps the idea funnel full. By the end you will have practiced the move that turns a disappointing result into a follow-up aim.

Run of show

0:00–0:05 · Challenge introduction (5 min)
0:05–0:20 · Work in your group (15 min)
0:20–0:22 · Post your best prompt (2 min)
0:22–0:32 · Share & debrief (10 min)
0:32–0:35 · Reset (3 min)

Bad prompt to better prompt

Weak prompt

My pain study didn’t find a significant effect. What are some hypotheses for why?

The AI doesn’t know your design, so it hedges: “maybe your sample was too small, maybe there was measurement error, maybe the effect doesn’t exist.” All true, all useless. You get statistical excuses, not mechanisms — and nothing you could actually test next week.

Stronger prompt

You are a skeptical pain-neuroscience PI on my thesis committee. Below is a surprising NULL result with full methods. Assume the null is REAL and not a power problem — the study was well-powered. Generate 5 mechanistic hypotheses that could each independently explain why the expected effect was absent. Make them genuinely distinct (different levels: molecular, circuit, psychological, methodological-but-substantive, contextual). For each: (1) one-sentence mechanism, (2) the key prediction it makes that the others don’t, (3) a feasibility rating 1-5 for a grad student with a small budget, and (4) your confidence it’s the true explanation (low/med/high) with one sentence of why. Then pick the two hypotheses that are hardest to tell apart and propose ONE experiment that would discriminate between them.

[paste the study brief]

It assigns a role (skeptic), removes the easy out (forbids “underpowered”), forces divergent hypotheses across levels of analysis, and builds in self-critique (feasibility + confidence) and the discriminating-experiment payoff. You get a comparison table you can argue with, not a shrug.

Prompting moves to try

Decompose the creative goal. Don’t ask for “hypotheses” in one breath. Ask first for mechanisms at five different levels of analysis (molecular, circuit, systems/network, psychological/cognitive, social-contextual). Forcing distinct levels kills the “five flavors of the same idea” problem.
Role-play a hostile committee. “You are a reviewer who thinks my whole effect was a Type I error to begin with” surfaces sharper, less polite hypotheses than a neutral assistant. Swap roles (clinician vs. computational modeler) and compare what each notices.
Forbid the lazy answers. Explicitly ban “underpowered,” “noisy measurement,” and “the effect just isn’t real” unless the AI can specify the substantive version (e.g., “ceiling effect on the 0-10 NRS because baseline pain was already low”).
Adversarial self-evaluation. After it lists the five, say: “Now act as the harshest possible critic. Score each hypothesis 1-5 on feasibility AND plausibility, flag the one that’s secretly untestable, and tell me which two you’re least confident you can distinguish.” Calibrated doubt is more useful than confident filler.
Demand a discriminating experiment. The deliverable isn’t five ideas — it’s the one study that splits two of them. “Design the cheapest experiment whose result would be different under H2 vs H4, and state the predicted result for each.”
Ask it to improve your prompt. “Before answering, rewrite my prompt to get a more rigorous, more creative set of hypotheses, then answer your improved version.” Steal the upgrade for the next round.

Starter materials

Hand each team this brief. It has enough method detail to actually reason about — and the null is real-feeling and genuinely surprising.

The Null Result — “The Placebo That Wasn’t”

Background. Conditioned placebo analgesia is one of the most reliable phenomena in pain research: pair an inert cream with a covertly lowered stimulus, and people later report less pain from that cream even at full stimulus intensity. Our lab set out to replicate and extend this in a within-subject heat-pain paradigm, then test whether the effect transfers to a novel body site.

Sample. N = 64 healthy adults (34 F, 30 M; ages 19-41). Power analysis targeted d = 0.45 at 90% power; final sample exceeded that target. No attrition issues; data quality checks passed.

Design & methods.

Stimuli: Contact thermode on the left volar forearm. Individually calibrated so “control” and “placebo” skin sites felt identical at baseline (both rated ~50/100 on a visual analog scale, VAS).
Conditioning (Day 1): Two visually distinct creams (“active analgesic” vs. “control”) applied to two adjacent forearm patches. Unknown to participants, the temperature under the “active” cream was covertly reduced by 2.5 °C during 16 conditioning trials, so it genuinely hurt less. Standard, well-validated deception.
Test (Day 1, same session): Temperatures equalized — both patches now delivered the identical, full-intensity stimulus. Participants rated pain on the VAS. Expected: lower pain under the “active” cream (the placebo effect).
Transfer test: Same creams applied to the right forearm (novel site), equalized full stimulus, rated.
Manipulation checks: Post-study, 89% of participants correctly believed the “active” cream was a real analgesic; expectancy ratings were high and did not differ from a prior cohort that showed robust placebo analgesia.

The surprising result. Placebo analgesia at the original conditioned site was essentially absent: mean VAS difference (control − placebo) = 0.8 points (95% CI: −1.6 to 3.2), Bayes factor favoring the null. The transfer site showed the same flat pattern. Yet expectancy was intact, conditioning trials confirmed participants noticed the relief during learning, and the same lab’s previous study (different cohort, nearly identical protocol) found a robust 14-point effect.

One quirk worth noting. The only procedural change from the previous (successful) study: this time, conditioning and test happened in a single continuous session rather than across two days, and the lab had moved to a new room with the thermostat set noticeably cooler (ambient ~19 °C vs. the old room’s ~23 °C).

Team scoresheet — fill one row per hypothesis

#	Mechanism (1 sentence)	Level (molecular / circuit / psych / contextual / methodological)	Unique prediction	Feasibility 1-5	AI’s confidence (L/M/H)
H1
H2
H3
H4
H5

Team’s favorite discriminating experiment (one paragraph): which two hypotheses does it separate, what’s the cheapest design, and what result is predicted under each?

Debrief questions

Which hypothesis got the “most creative” vote, and which got “most feasible” — and why is creativity so often inversely related to feasibility here?
Did the AI ever sneak a statistical excuse back in disguised as a mechanism? How did you catch it, and what does that tell you about reading its output critically?
Look at that ambient-temperature / single-session quirk. Did your team’s hypotheses converge on it, or did the AI surface a mechanism none of you had considered? Which is more useful?
For your discriminating experiment: would the predicted-results-under-each-hypothesis table actually convince a reviewer, or could a third hypothesis explain both outcomes?
Where did the AI sound most confident, and was that confidence earned by the evidence in the brief — or just fluent prose?

Level up

Pre-register on the spot. Have the AI draft a one-paragraph pre-registration for your discriminating experiment: directional prediction, primary outcome, and the result that would falsify your favored hypothesis. If it can’t be falsified, it isn’t done.
Adversarial steelman. Feed your five hypotheses to a second AI (or a fresh chat) and ask it to find the single hypothesis that, if true, would make the other four irrelevant — then defend or dismantle that claim with your team.
Cost the experiment. Ask the AI for a rough resource estimate (participants, hours, equipment, IRB considerations) for your discriminating study, then sanity-check every number it gives you. Treat each figure as a claim to verify, not a fact.

← Back to the Challenge menu · Need firepower? Visit the AI Toolkit