Picture This

Rescue an ugly data table and turn it into the most informative figure in the room.

A core live challenge: prompt an AI to turn a messy pain-science data table into a genuinely informative figure — right chart type, error bars, colorblind-safe palette, honest labels.
Pain relief by group (95% CI) Drug Placebo ΔVAS (cm) Drug Placebo

You inherit a messy data table. Your job: get an AI to make it understandable, not just pretty.

Recommended core challenge 35 min Beginner-friendly teams of 3-5 Any chat AI

The goal

The Goal Starting from the messy table in Starter materials, produce the single most informative figure a reader could glance at and immediately understand. “Informative” means: the right chart type for this question, honest treatment of the missing-value codes, error/uncertainty shown, a colorblind-safe palette, real axis labels with units, and a title that states the finding. A figure that is gorgeous but misleading loses. We are deliberately not telling you which chart or which tool sequence to use — that discovery is the point.

Why it matters

Much of peer review is really an argument about figures. Reviewers often form an opinion of a study in the few seconds it takes to read Figure 2 — before they read a word of the methods. A clear figure of a VAS pain score across treatment arms can carry a paper; a 3D pie chart with no error bars can undermine one. And the messy table you are about to meet is not a strawman: real pain datasets arrive with -999 missing codes, mixed units, cryptic column names, and a free-text “notes” column. Learning to steer an AI from that raw input to an honest, publishable figure is a skill you will use on your next manuscript, grant figure, or journal-club slide.

Run of show

  • 0:00–0:05 · Challenge introduction (5 min)
  • 0:05–0:20 · Work in your group (15 min)
  • 0:20–0:22 · Post your best prompt (2 min)
  • 0:22–0:32 · Share & debrief (10 min)
  • 0:32–0:35 · Reset (3 min)

Bad prompt to better prompt

Weak prompt
Make a nice graph from this data.

Why it disappoints: the AI guesses everything. It plots the -999 missing codes as real values (so the placebo group appears to cure pain into negative numbers), invents a chart type, drops the units, defaults to a red/green palette that roughly 8% of men can’t distinguish, and titles it “Chart.” It looks tidy and is quietly wrong.

Strong prompt
You are a data-viz editor for a pain-medicine journal. Here is a messy trial table (pasted below). First, tell me back the data dictionary as YOU interpret it — what each column means, the units, and how you’ll handle the missing-value code -999 — and STOP for my confirmation before plotting. Then propose the single most informative chart to compare post-treatment pain between the Drug and Placebo groups, and explain in one line why that chart type beats the alternatives. Build it with: group means with 95% confidence intervals, individual participant points jittered behind the bars, a colorblind-safe palette (Okabe-Ito), axis labels with units, and a title that states the actual finding. Use cm for VAS throughout and convert any rows recorded in mm. Flag any row you had to drop and say why.

Why it works: it assigns a role, forces the AI to surface its assumptions before committing (catching the -999 and the mm/cm mix), specifies the comparison rather than “a graph,” and pins down every element that separates an honest figure from a pretty one — uncertainty, raw points, accessible color, units, and a title that earns its place.

Prompting moves to try

  • Decompose the goal. Split “make a figure” into stages — (1) infer the data dictionary, (2) clean missing codes and units, (3) choose the chart, (4) render it — and make the AI pause after stage 1 for your sign-off. Most bad figures are bad because step 1 was skipped.
  • Role / identity prompting. “You are a reviewer for Pain” or “You are the figure editor at Nature Neuroscience” pulls in conventions (CIs not SEMs, accessible palettes, no chartjunk) you’d otherwise have to enumerate.
  • Adversarial self-evaluation. After the first draft: “Critique this figure as a harsh Reviewer 2. List three ways it could mislead a reader, then rate your confidence (0-100%) that the missing data were handled correctly.” Make the AI attack its own work before you do.
  • Ask the AI to improve your prompt. “Here’s my prompt and the figure it produced. Rewrite my prompt so the next attempt is more informative and harder to misread.” Let the AI upgrade its own instructions.
  • Force the chart-type defense. “Give me three chart types for this comparison, rank them for informativeness (not beauty), and justify the winner.” Bar-with-CI vs. raincloud vs. paired-slope is a real decision — make it explicit.
  • Demand a units & exclusions log. “List every transformation you applied and every row you excluded, with the reason.” That log is most of your methods paragraph.

Starter materials

Paste this entire block into your AI. It is deliberately messy — mixed column names, two units of measurement, a -999 missing code, a stray text entry, and inconsistent group labels. Cleaning it up is part of the challenge.

Study: randomized double-blind trial of a novel mu-opioid-sparing analgesic vs placebo for chronic low-back pain. Outcome = change in pain from baseline to week 6. Higher pain score = worse. Some sites recorded VAS in cm (0-10), one site logged in mm (0-100). Missing/withdrawn = -999.

subj_ID,Grp,age_yrs,sex,VAS_base,VAS_wk6,unit,opioid_MME_day,notes P001,drg,54,F,7.2,4.1,cm,0, P002,Placebo,61,M,6.8,-999,cm,15,withdrew wk3 P003,DRG,47,f,8.1,3.9,cm,0, P004,pbo,39,M,5.9,5.4,cm,20, P005,drug,66,F,7.7,4.8,cm,5,“felt better, mostly” P006,Placebo,52,m,90,76,mm,30, P007,drg,58,F,6.4,-999,cm,0,lost to follow-up P008,pbo,44,M,7.0,6.6,cm,25, P009,Drug,71,F,81,33,mm,0, P010,placebo,49,M,6.1,5.9,cm,18, P011,drg,63,f,7.9,4.2,cm,0, P012,PBO,55,M,6.7,6.1,cm,22, P013,drug,41,F,8.4,5.0,cm,10, P014,Placebo,68,M,70,68,mm,28, P015,drg,50,F,7.1,3.7,cm,0,protocol deviation? P016,pbo,57,M,6.9,6.3,cm,24,

Landmines hidden on purpose (don’t tell your AI — see if it catches them):

  • Group labels are a mess: drg / DRG / drug / Drug / drudg-adjacent typos all mean Drug; pbo / PBO / Placebo / placebo all mean Placebo.
  • Two units: rows P006, P009, P014 are in mm (0-100); everything else is cm (0-10). They must be reconciled before plotting.
  • -999 is missing data, not a pain score of negative nine hundred.
  • The real outcome is a change score (VAS_wk6 - VAS_base), more negative = more relief — not the raw week-6 value.
  • opioid_MME_day (morphine milligram equivalents) is a tempting secondary variable for a “level up.”

Quick scoring rubric for the debrief (1 point each, 6 max):

  • Correct chart type for a two-group comparison with uncertainty
  • Missing-value code handled (excluded or imputed, and stated)
  • Units reconciled to a single scale
  • Colorblind-safe palette
  • Axis labels with units + finding-stating title
  • Change score used, not raw week-6 pain

Debrief questions

  • Which single instruction moved the figure the most — and which group’s prompt got there fastest?
  • Did your AI catch the -999 and the mm/cm mix on its own, or only after you forced it to state assumptions? What does that tell you about trusting silent output?
  • Bar-with-CI, raincloud, or paired-slope: which chart type best honors this design, and did any AI argue for one you hadn’t considered?
  • Where did the AI confidently do something wrong? How would a reader have ever known?
  • If you handed this figure to Reviewer 2 right now, what’s the first thing they’d attack?

Level up

  • Add a covariate honestly. Ask for a second panel relating baseline opioid use (opioid_MME_day) to relief — without implying causation the design can’t support.
  • Reproducibility test. Have the AI emit the actual plotting code (R/ggplot2 or Python/matplotlib) and a one-paragraph figure legend, then re-run it to confirm the figure matches the description.
  • Stress-test for honesty. Ask the AI to produce the most misleading defensible version of the same figure (truncated y-axis, SEM disguised as CI, dropped withdrawals), then explain how a reader would spot each trick. Knowing the common distortions makes you a better reviewer.

← Back to the Challenge menu  ·  AI Toolkit →

Back to top