Rapid Review

A 2-page evidence brief with real references, built in minutes.

Use search-grounded AI to answer a hard pain-neuroscience question, then verify every citation before you trust it.

A real question, a real answer, real references — and no hallucinated citations to walk back in front of your PI.

35 min Advanced teams of 3-5 Research-capable AI

The goal

Produce a tight, 2-page evidence brief that answers a thorny pain-neuroscience question — clear bottom line, the 5-8 most load-bearing papers, and a short “what’s still uncertain” section — fast enough that you’d actually do it before a journal club or a grant paragraph. Every claim should be traceable to a source that really exists and really says what you claim it says.

That’s the whole goal. How you get there — which tool, which prompt, which order — is up to your team.

Why it matters

This is the most common AI task in academic life that people do badly. You need to get oriented in an unfamiliar literature in an afternoon: a reviewer asks about a mechanism you half-know, a collaborator floats a hypothesis, a grant needs a “significance” paragraph with citations by Friday. A generic chatbot will readily write you a fluent, confident, well-formatted review — studded with references that do not exist.

The trap isn’t using AI to read the literature. The trap is trusting the bibliography. The skill that separates a researcher from a chatbot is the verification loop: search-grounded retrieval, then human checking of every citation. Get that loop right and it becomes a reliable part of your workflow. Get it wrong and you’ve put a phantom paper into a manuscript that goes to peer review.

Run of show

0:00–0:05 · Challenge introduction (5 min)
0:05–0:20 · Work in your group (15 min)
0:20–0:22 · Post your best prompt (2 min)
0:22–0:32 · Share & debrief (10 min)
0:32–0:35 · Reset (3 min)

Bad prompt to better prompt

Weak prompt

Write a review with references about whether opioids change the brain.

Why it disappoints: no scope, no audience, no time window, no tool grounding, and — fatally — no instruction to cite real, verifiable sources. A plain chatbot answers from memory and invents plausible-looking references (right-sounding authors, real journals, fabricated DOIs). It reads great and is partly fiction.

Stronger prompt

You are a pain neuroscientist writing a 2-page evidence brief for a graduate journal club. Use web search and cite only sources you can retrieve right now; do not cite from memory.

Question: Does chronic opioid use cause remodeling of opioid receptors and related reward/pain circuits in the human brain, and which of those changes are reversible after cessation?

Decompose into: (1) receptor-level changes (mu-opioid receptor density/availability, e.g. PET findings), (2) circuit/structural changes (reward, descending modulation), (3) reversibility with abstinence. For each sub-question give the 2-3 strongest sources, prefer human studies and recent systematic reviews/meta-analyses, and note effect direction and sample size.

Output: a bottom-line answer (3 sentences), the three sub-sections, an “Uncertainties & conflicting findings” box, and a numbered reference list with title, journal, year, and DOI/PMID for each. After the list, rate your confidence (0-100%) that EACH reference is real and correctly described, and flag any you are unsure about.

Why it works: it sets a role and audience, forces search grounding (“don’t cite from memory”), decomposes the question so retrieval is targeted, demands verifiable identifiers, and ends with an adversarial self-check that surfaces the references most likely to be hallucinated — exactly the ones your team should verify first.

Prompting moves to try

Decompose before you retrieve. Split the big question into receptor-level, circuit-level, and reversibility sub-questions. Narrow sub-questions return real, specific papers; broad ones invite hand-wavy synthesis.
Assign an identity. “You are a systematic-review methodologist” or “you are a skeptical NIH study-section reviewer” changes what counts as good evidence and pushes the model toward meta-analyses and primary human studies over blog-grade claims.
Force the grounding. Explicitly say “use web search, cite only sources you can open right now, do not cite from memory, include DOI/PMID.” On search-capable tools this is the difference between retrieval and confabulation.
Make it grade its own bibliography. Ask the AI to assign each reference a confidence score that it is real and correctly summarized, and to flag the shakiest ones. The flagged refs are your verification to-do list.
Adversarial second pass. “Now argue the opposite conclusion using the same literature — what would a critic cite?” This catches cherry-picking and surfaces the conflicting findings that make a brief honest.
Ask it to improve your prompt. “Before answering, rewrite my prompt to get a more rigorous, better-sourced brief, then answer the improved version.” Cheap, and it often adds constraints you forgot.

Starter materials

Question set (pick one per team)

Q1 — Receptor remodeling. Does chronic opioid use cause measurable remodeling of mu-opioid receptors (density/availability) in the human brain, and over what timescale?
Q2 — Behavioral consequences. Does opioid-induced neural remodeling drive adverse psychological/behavioral outcomes (anhedonia, hyperalgesia, affective dysregulation)?
Q3 — Reversibility. Which opioid-induced changes in receptors and circuits reverse with sustained abstinence, and which appear persistent?
Q4 — Bonus. Is opioid-induced hyperalgesia a distinct phenomenon from tolerance and physical dependence, mechanistically and clinically?

Citation-verification checklist (run on EVERY reference)

Paste this into the shared doc and mark each box. A reference is “verified” only when all of 1-5 pass.

Verify every citation

It exists. Search the exact title in PubMed / Google Scholar / the journal site. A real hit, not a near-match.
The IDs resolve. The DOI opens the actual paper; the PMID matches the same title and authors. (Hallucinated DOIs 404 or point to a different paper — this is the fastest tell.)
Authors + year + journal match what the AI wrote. Mismatched year or journal = treat as fabricated until proven otherwise.
It actually says what you claim. Open the abstract. Does it support the specific point you’re citing it for, in the right direction? (AIs often cite real papers for claims they don’t make.)
It’s the right kind of evidence. Human vs. rodent, sample size, study design. A 12-rat study should not be cited as if it settled a human question.
Triangulate the load-bearing claims. Your top 2-3 conclusions should each rest on more than one verified source.

Tally for the debrief: __ refs proposed · __ verified · __ fabricated/wrong · __ real-but-misdescribed.

Brief template (drop your verified findings into this)

EVIDENCE BRIEF — [question] Prepared by: [team] · Tool used: [Perplexity / Gemini / other] · Date: [ ]

BOTTOM LINE (3 sentences max):

WHAT THE EVIDENCE SHOWS - Sub-question 1 — [finding] [refs] - Sub-question 2 — [finding] [refs] - Sub-question 3 — [finding] [refs]

UNCERTAINTIES & CONFLICTING FINDINGS:

REFERENCES (verified only) 1. Authors. Title. Journal Year. DOI/PMID. [VERIFIED] …

VERIFICATION LOG: __ proposed / __ verified / __ rejected

Debrief questions

Which tool surfaced the most real references — and which produced the most confident-sounding fake ones? What did the fakes have in common?
Of the references your tool proposed, what fraction survived the full checklist? Where did most failures happen — nonexistent papers, broken DOIs, or real-but-misdescribed?
Did the adversarial self-scoring actually flag the bad references, or did it miss them (or flag good ones)? Would you trust it as a triage step?
Did decomposing the question change the quality of retrieval versus asking it all at once?
Would you put this brief in front of your PI as-is? What’s the minimum extra work before you’d attach it to a grant?

Level up

Two-tool cross-check. Run the same question through a second search-grounded tool and keep only references both tools independently surface and that you can verify. Compare the bottom lines.
Build a reusable verifier prompt. Write a prompt that takes any AI-generated reference list and returns a per-citation verdict (exists / IDs resolve / claim supported), then test it on your own brief.
Find the conflict. Push the tool to locate a paper whose findings contradict your bottom line, verify it, and revise the “Uncertainties” box. A brief that acknowledges its weak spots is the more credible one.

Back to the Challenge menu · AI Toolkit