Reviewer 4

Stand up a multi-agent grant clinic that critiques your R01 like a study section.

Use agentic AI to run a full mock peer review of your own grant draft — scores, missing citations, study-section fit, and a candid line edit.

You wrote an R01. Now convene a study section that is always available, never territorial, and tells you exactly where your aims are most likely to be challenged — before the real one does.

35 min Advanced · Claude Code / Codex teams of 3-5 Agentic CLI

The goal

The Goal Take a real grant draft (the instructor’s sample, or your own) and walk away with an impact score, a written critique, and a prioritized revision list that you actually trust. Trust is the hard part: the output should tell you where it is confident, where it is guessing, and which fixes will move the score the most. The bar to aim for is a result you could hand to your PI and have them take seriously.

Why it matters

Study section is a black box that decides whether your science happens. Most trainees see it exactly once per submission, months too late, through three terse paragraphs from “Reviewer 4” — the one who clearly read a different grant. You cannot rehearse a panel. But you can simulate one.

A well-orchestrated agent doesn’t just proofread. It role-plays a skeptical immunologist who doubts your pain model translates, surfaces the 2023 paper you should have cited, notices that Aim 2 quietly depends on Aim 1 succeeding, and flags that “central sensitization” is doing too much load-bearing work in your Significance section. That is the feedback loop that turns a fundable idea into a funded one — and it is exactly the loop a busy mentor rarely has time to run line by line.

Run of show

0:00–0:05 · Challenge introduction (5 min)
0:05–0:20 · Work in your group (15 min)
0:20–0:22 · Post your best prompt (2 min)
0:22–0:32 · Share & debrief (10 min)
0:32–0:35 · Reset (3 min)

Bad prompt to better prompt

Weak prompt

Here is my grant. Review it and tell me if it’s good.

The model returns a polite book report: “This is a well-written proposal on an important topic.” It praises your enthusiasm, suggests you “add more detail,” and assigns no score. It has no persona, no rubric, no incentive to be hard on you — so it is reflexively nice. You learn nothing a friendly labmate wouldn’t tell you over coffee.

Strong prompt

You are a permanent member of NIH study section reviewing this R01 under the NINDS portfolio. Read SOUL.md for your persona and TASKS.md for the rubric. Score Significance, Investigators, Innovation, Approach, and Environment each 1-9 (1=exceptional, 9=poor), then give an Overall Impact score. Be the harshest fair reviewer on the panel — assume the applicant is talented and the bar is the top 10th percentile. For every criticism, cite the exact sentence or aim it attacks. End with: (a) your three most score-limiting concerns, ranked, and (b) a confidence rating (low/med/high) for each, plus what additional info would raise your confidence.

It works because it assigns an identity with a stake, a scoring rubric the model must fill in, an adversarial standard (top decile, harshest-but-fair), a demand for evidence-anchored critique, and a self-calibration step so you know which complaints to take seriously.

Prompting moves to try

Decompose the review into agents. Don’t ask for “a review” — ask for five named reviewers (a mechanism person, a clinician, a statistician, a translational skeptic, a generalist) who each score independently, then a Chair who reconciles them into one impact score. Disagreement between agents is signal.
Role with a real address. “You are a permanent member of the Somatosensory & Pain Systems (SPS) study section” beats “you are a reviewer.” Ask it to name the most likely standing members and what each tends to care about, so you can pre-empt their hobbyhorses.
Adversarial self-evaluation. After the critique, have the model grade its own review: “Which of your criticisms are airtight vs. speculative? Rate your confidence 1-5 and flag anything you may have hallucinated, especially citations.” Then verify the flagged ones yourself.
Citation gap hunt. Ask specifically for “directly relevant work from the last 3 years that this draft fails to cite, and why a reviewer would expect it.” Treat every suggested reference as a lead to verify, not a fact.
Ask the AI to improve your prompt. “Before you start, rewrite my instructions to get a more rigorous review, then proceed with your improved version.” Models are often better at specifying the job than you are.
Separate diagnosis from surgery. Run critique first; only after you trust it, ask for line edits and a revised draft. Mixing them produces a rewrite that papers over the real problems.

Starter materials

Drop these two files in your working folder, point the agent at them, then feed it the grant. Built for Claude Code or Codex, but the contents work in any chat.

SOUL.md — the reviewer persona

SOUL.md — Reviewer persona

You are MORGAN REYES, PhD/MD, a permanent member of an NIH study section that reviews pain neuroscience R01s (think Somatosensory & Pain Systems). 18 years funded; two R01s and a U19. You have reviewed ~400 grants.

VOICE: Direct, dry, occasionally witty, never cruel. You respect the applicant’s time and intelligence. You assume the person is smart, so you do not explain basics — you pressure-test logic.

STANDARD: The payline is brutal. A “good idea” scores a 4. To get a fundable score the grant must be important AND feasible AND de-risked AND clearly written. You reward de-risking (preliminary data, alternatives, power) and punish hand-waving.

HABITS:
- You read the Aims page first and decide if you care. If Aim 2 depends on Aim 1, you say so.
- You hunt for the rate-limiting experiment and ask what happens if it fails.
- You are allergic to jargon used as proof (“central sensitization explains this”) and to clinical-significance claims unsupported by effect sizes.
- You always anchor a criticism to a specific sentence or figure.
- You distinguish what you KNOW from what you SUSPECT, and you say which is which.

RED LINES: You never fabricate citations or data. If you are unsure a reference exists, you say “verify this” rather than asserting it.

TASKS.md — the orchestration checklist

TASKS.md — Grant clinic run order

Work through these as separate, labeled sections. Do 1-3 before any rewriting.

1. FULL REVIEW. Adopt SOUL.md. Score Significance, Investigators, Innovation, Approach, Environment (1-9 each) and give an Overall Impact (1-9). Write a Strengths / Weaknesses critique per criterion, anchored to specific text.
2. REVIEW THE REVIEW. Re-read your own critique. Rate confidence (low/med/high) for each major point. Flag anything speculative or possibly hallucinated. List what info would raise your confidence.
3. STUDY SECTION FIT. Name the 1-2 best-fitting NIH study sections for this proposal and 3-5 standing-member archetypes likely to be assigned, with what each will scrutinize. (Verify section names; do not invent specific people.)
4. MISSING CITATIONS. List directly relevant work from the last ~3 years the draft should cite, with one line on why each matters. Mark each “verify.”
5. CONCEPT RISK. Flag terms/assumptions reviewers may not share or may object to (e.g., model translatability, biomarker validity). For each, give a one-sentence clarification to add.
6. LOGICAL FLOW. Map the argument from Significance to Aims. Note gaps, circularity, and aim dependencies. Suggest a reorder if it helps.
7. LINE EDIT. Copyedit the Specific Aims page line by line for clarity and concision. Show before/after.
8. (FOR FUN) REVISE + COMPLY. Produce a tightened Aims page addressing the top 3 concerns, and check formatting compliance (margins, font, page limits, required sections) against current NIH rules.

Sample target (use if you don’t have your own). Paste this stub as the grant under review — it has deliberate flaws for the agents to find.

SPECIFIC AIMS (draft). Chronic low back pain (cLBP) affects 80M U.S. adults and current opioids fail most of them. We have discovered that microglial TREM2 signaling drives central sensitization. We hypothesize that blocking TREM2 will reverse chronic pain. Aim 1: Using our novel mouse cLBP model, we will show TREM2 knockout reduces mechanical allodynia. Aim 2: We will use the validated biomarker from Aim 1 to stratify human cLBP patients in a pilot trial and show our TREM2 inhibitor works. Aim 3: We will perform fMRI to prove central sensitization is reduced. This work will cure chronic pain. Preliminary data: n=4 mice show a trend (p=0.08).

Debrief questions

Which criticism from the agent would actually change how you’d revise — and which was generic filler you could have written yourself? What made the difference in the prompt?
The model rated its own confidence. Where was it overconfident, and how would you catch that without already being an expert?
Did the multi-agent panel surface disagreements a single reviewer would have missed? Was the synthesis honest about them or did it smooth them over?
Every suggested missing citation is a hypothesis. What is your verification workflow before you’d add one to a real grant?
Where did the persona (SOUL.md) clearly change the output versus a default model? Which trait mattered most — the harsh standard, the evidence-anchoring, or the “say what you don’t know” rule?

Level up

Run a rebuttal loop. Have a second agent play the applicant writing the Introduction-to-Resubmission, responding to Reviewer Morgan’s critique. Then have Morgan re-score. Did your responses actually move the number?
Calibrate against ground truth. If you have a previously reviewed grant with real summary-statement scores, run the clinic blind and compare. How close was the simulated impact score, and where did it diverge?
Build a reusable panel. Turn SOUL.md and TASKS.md into a slash command / saved project so any draft can be dropped in and reviewed in one shot — your personal pre-submission study section.

Ethics

Responsible use Never run this on a grant you have been assigned to review. NIH, CIHR, and journal peer review are confidential. Uploading someone else’s submission to an AI tool breaches that confidentiality, may violate your reviewer agreement, and can constitute research misconduct — even if you only use the AI to “help organize your thoughts.” This challenge is for improving your own drafts (or shared teaching samples) only. Two more: (1) Treat every AI-suggested citation, statistic, and study-section name as unverified until you confirm it — models hallucinate references with great confidence. (2) The judgment stays yours. Use the panel to find blind spots, not to outsource the thinking that makes the science yours.

← Back to the Challenge menu · AI Toolkit →