Reviewing scientific papers with AI: Part 4
Why expert human peer review won't be going away any time soon.
In the previous three posts on this topic, I introduced the idea of AI-assisted peer review and the concept of meta-review using multiple LLMs, and described a new tool for performing LLM meta-review. In this post I will show an example of the tool in use, and compare its results to those from human reviewers on the same paper.
Please read the entire thing to the end, but here is a TL/DR to help frame things: AI tools are very good at creating simulacra of peer reviews, but at least in this one example were not very good at identifying deep issues identified by human reviewers.
Automated review of our preprint
In the spirit of dog-fooding I downloaded a recent preprint from our lab and ran the review tool on it:
❯ mkdir -p papers
❯ wget https://www.biorxiv.org/content/10.1101/2025.01.14.633053v2.full.pdf -P papers
❯ ai-peer-review review papers/2025.01.14.633053v2.full.pdf
Output will be saved to: papers/2025.01.14.633053v2.full
Processing paper: papers/2025.01.14.633053v2.full.pdf
Selected models: gpt4-o1, gpt4-o3-mini, claude-3.7-sonnet, gemini-2.5-pro, deepseek-r1, llama-4-maverick
Processing models: gpt4-o1, gpt4-o3-mini, claude-3.7-sonnet, gemini-2.5-pro, deepseek-r1, llama-4-maverick
Review from gpt4-o1 saved to papers/2025.01.14.633053v2.full/review_gpt4-o1.md
Review from gpt4-o3-mini saved to papers/2025.01.14.633053v2.full/review_gpt4-o3-mini.md
Review from claude-3.7-sonnet saved to papers/2025.01.14.633053v2.full/review_claude-3.7-sonnet.md
Review from gemini-2.5-pro saved to papers/2025.01.14.633053v2.full/review_gemini-2.5-pro.md
Review from deepseek-r1 saved to papers/2025.01.14.633053v2.full/review_deepseek-r1.md
Review from llama-4-maverick saved to papers/2025.01.14.633053v2.full/review_llama-4-maverick.md
Generating meta-review...
Meta-review saved to papers/2025.01.14.633053v2.full/meta_review.md
Concerns table saved to papers/2025.01.14.633053v2.full/concerns_table.csv
All results saved to papers/2025.01.14.633053v2.full/results.json
Here is the meta-review; note that the models are referred to by codenames in the actual meta-review process (to avoid biases for/against particular models in the meta-review), but I have replaced the code names with the model names here to make it more readable. The model names are stored in the results.json file that is saved alongside the reviews, in a folder named after the input file.
Meta-Review of "Unintended bias in the pursuit of collinearity solutions in fMRI analysis"
This meta-review synthesizes the feedback from six reviewers (gpt4-o1, gpt4-o3-mini, claude-3.7-sonnet, gemini-2.5-pro, deepseek-r1, and llama-4-maverick) on the research article "Unintended bias in the pursuit of collinearity solutions in fMRI analysis." The paper investigates biases arising from common fMRI modeling strategies aimed at reducing collinearity, particularly in the context of the Monetary Incentive Delay (MID) task from the ABCD study. It proposes a "Saturated" model as a more robust alternative.
Common Points Raised Across Reviewers:
Overall, the reviewers found the study to be timely, important, and well-executed, addressing a significant methodological challenge in fMRI analysis. There was general agreement on several strengths:
Importance of the Problem: All reviewers acknowledged the critical nature of addressing collinearity and potential biases in fMRI modeling (gpt4-o1, gpt4-o3-mini, claude-3.7-sonnet, gemini-2.5-pro, deepseek-r1, llama-4-maverick).
Proposed Solution (Saturated Model): The "Saturated" model, which includes all task components and their durations, was generally seen as a principled and promising approach to minimizing bias (gpt4-o1, gpt4-o3-mini, claude-3.7-sonnet, gemini-2.5-pro, deepseek-r1, llama-4-maverick).
Methodology: The use of both simulations and real ABCD data to demonstrate the issues and the proposed solution was commended (gpt4-o1, claude-3.7-sonnet, gemini-2.5-pro, deepseek-r1).
Novel Contribution (cVIF): The introduction of the "contrast VIF" (cVIF) was highlighted as a valuable methodological contribution for assessing collinearity's impact on specific contrasts of interest (gpt4-o1, claude-3.7-sonnet, gemini-2.5-pro, llama-4-maverick).
Despite these strengths, several common concerns were also raised:
Generalizability of Findings: A frequent point of concern was the study's primary focus on the MID task within the ABCD dataset. Reviewers gpt4-o1, gpt4-o3-mini, and deepseek-r1 questioned how readily the findings and the Saturated model could be generalized to other fMRI tasks, designs, and datasets with different characteristics.
Realism of Simulation Noise Model: Reviewers gpt4-o1, claude-3.7-sonnet, and deepseek-r1 pointed out that the simulations assumed independent Gaussian noise, which is not fully representative of fMRI data's temporal autocorrelation. This simplification could affect the generalizability of simulation results.
Saturated Model Complexity and Power: The increased complexity of the Saturated model, with its larger number of regressors, led to concerns about potential overfitting (gpt4-o1) and statistical power or efficiency trade-offs (gpt4-o1, gemini-2.5-pro, deepseek-r1). While the authors suggest modest efficiency differences, a more thorough discussion was desired.
Assumption of Constant Neuronal Activation: The Saturated model's use of boxcar functions assumes constant neuronal activation across the duration of each task component. Reviewers gpt4-o1, claude-3.7-sonnet, gemini-2.5-pro, and llama-4-maverick noted this as a potential oversimplification that might not capture the true temporal dynamics of neural responses.
Clarity and Accessibility: Some reviewers (gpt4-o1, gpt4-o3-mini, deepseek-r1) found the paper highly technical, suggesting that simplifying the presentation of key concepts could improve its accessibility to a broader audience.
Specific Concerns and Major Concerns:
Beyond the common themes, reviewers raised specific points, some of which constitute major concerns for the manuscript.
Major Concerns:
Omission of HRF Derivatives in the Saturated Model (gemini-2.5-pro): Reviewer gemini-2.5-pro raised a significant concern that the Saturated model, by omitting temporal derivatives to reduce VIFs, might introduce a different form of mismodeling bias if the canonical HRF is not a perfect fit. This trade-off between reducing omitted variable bias and potential HRF mismodeling bias needs explicit discussion.
Generalizability of Findings (gpt4-o1, gpt4-o3-mini, deepseek-r1): As mentioned above, the limited scope (MID task, ABCD data, Siemens scanners for real data as noted by deepseek-r1) is a major concern regarding the broader applicability of the proposed model and conclusions.
Realism of Simulation Noise Model (gpt4-o1, claude-3.7-sonnet, deepseek-r1): The use of independent Gaussian noise in simulations is a major limitation that could impact the validity of the demonstrated bias magnitudes and model performance.
Saturated Model Complexity, Power, and Efficiency (gpt4-o1, gemini-2.5-pro, deepseek-r1): The potential downsides of the Saturated model's complexity, including overfitting and reduced statistical power for certain contrasts, require a more comprehensive analysis and discussion.
Assumption of Constant Neuronal Activation / HRF Shape (gpt4-o1, claude-3.7-sonnet, gemini-2.5-pro, llama-4-maverick): The reliance on boxcar regressors assuming constant activation is a key simplification. The paper should more thoroughly acknowledge this limitation and its potential impact on accurately modeling the BOLD response.
Interpretation of Baseline in Saturated Model (gpt4-o1, gemini-2.5-pro): The definition and interpretation of "baseline" activity in the Saturated model, especially in tasks without explicit ITIs, needs clearer articulation and guidance.
Practical Significance of Identified Biases (deepseek-r1): Reviewer deepseek-r1 questioned whether the magnitude of the biases demonstrated by the traditional ABCD model is practically or clinically significant enough to alter substantive conclusions, a point that warrants discussion.
Validation and Implementation Details of cVIF (gpt4-o1, gpt4-o3-mini, deepseek-r1): While novel, the cVIF metric requires more thorough validation, clearer implementation details (gpt4-o1, gpt4-o3-mini), and potentially more examples of its utility (deepseek-r1).
Scanner-Specific Real Data Analysis (deepseek-r1): The real data analysis being limited to Siemens scanners (due to GE timing issues) was highlighted by deepseek-r1 as a factor limiting the generalizability of the real-data findings.
Other Specific Concerns:
Clarity of Hypothesis and Objectives (gpt4-o3-mini): Reviewer gpt4-o3-mini suggested that a more focused statement of the research questions would enhance clarity.
Methodological Rigor in Simulations (gpt4-o3-mini): Bravo requested more detailed descriptions and justifications for simulation parameters.
Critical Interpretation of Results (gpt4-o3-mini): Bravo felt the discussion could be more critical in interpreting the implications and potential limitations of the Saturated model.
Omission of Relevant Literature (gpt4-o3-mini): Bravo suggested strengthening the literature review with more recent fMRI modeling advancements.
Visual Representation of Data (gpt4-o3-mini): Improvements to figure clarity were recommended by gpt4-o3-mini.
Handling of "Miss" Trials (gemini-2.5-pro): Delta noted that collapsing different types of "Miss" trials (e.g., "Too Soon" vs. "Too Slow") in the Saturated model is a simplification that could be acknowledged.
Clarity on ABCD Timing File Corrections (gemini-2.5-pro): Delta requested clarification on the relevance of GE scanner timing corrections to the specific sample used in the paper.
cVIF Calculation with Derivatives (gemini-2.5-pro): Clarification was sought by gemini-2.5-pro on how derivative terms are handled in cVIF calculations for models that include them.
Conclusion:
The reviewers collectively agree that the paper addresses a critical issue in fMRI analysis and proposes a valuable advancement with the Saturated model and cVIF. However, to strengthen the manuscript and ensure the robustness and broad applicability of its findings, the authors should address the major concerns outlined above, particularly regarding the generalizability of the Saturated model, the realism of simulations, the trade-offs associated with model complexity (including the omission of derivatives), the assumptions about HRF shape, and the practical impact of the identified biases. Further validation and clarification of the cVIF metric would also be beneficial.
Comparison with human reviewers
As it turns out, we actually have a set of reviews from human experts on this paper, which I also submitted to Gemini 2.5 Pro and asked it to summarize them and compare them to the AI reviews using the following prompt:
the following are comments on the same paper from three human peer reviewers. Please summarize these comments and outline specific major concerns raised by any of the reviewers. Then, generate a summary that compares these three reviews with the previous set of reviews. Focusing on major concerns, describe which concerns were common between the two sets of reviews and which were only raised by one of the two sets of reviews.
Okay, here's a summary of the three human peer reviews, an outline of their major concerns, and then a comparison with the previous set of (AI-generated) reviews.
Summary of Human Peer Reviews & Major Concerns
The three human peer reviewers offer a mix of appreciation for tackling an important issue and significant critiques regarding the paper's methodology, novelty, and implications.
Overall Sentiment:
Reviewer 1: Expresses ambivalence. Acknowledges the importance of correct trial modeling and suspects widespread misinterpretation in published fMRI. However, questions the methodological novelty beyond a VIF generalization, suggesting the core message ("model all plausible causes") is obvious and perhaps better suited for a less formal outlet (blog/notebook).
Reviewer 2: Finds the paper well-written but believes it could be significantly improved by better connecting simulation and experimental results, restructuring for focus, and addressing the profound implications of its findings for prior ABCD research.
Reviewer 3: Recognizes the manuscript effectively highlights prevalent issues in MID task modeling but raises concerns about internal consistency in the authors' arguments (e.g., handling of derivatives vs. other components), practical feasibility vs. theoretical rigor, potential overstatement of claims, and the validity/interpretation of simulations.
**Specific Major Concerns Raised by Human Reviewers:**
Fundamental Modeling of Neural Activity vs. Stimulus/Response Durations (Reviewer 1, Reviewer 3):
R1: Strongly argues that the paper conflates stimulus/RT duration with the duration of neural activity. Neural responses are more complex (e.g., transient onsets, sustained components) and may vary by brain region. Suggests using more flexible neural basis sets (e.g., delta functions for onsets + boxcars for sustained activity) rather than assuming a single boxcar matching stimulus duration is sufficient or always appropriate.
R3: Questions if using RT as the duration for the response task makes a separate "probe" regressor redundant. Also echoes the point that if cue/fixation neural activity isn't constant, the same could apply to probe/feedback.
Handling and Interpretation of Temporal Derivatives (Reviewer 1, Reviewer 3):
R1: Disagrees with the paper's assertion that derivatives "serve no purpose" if not used for non-linear amplitude correction. Argues they are still useful for capturing HRF variability via linear F-contrasts.
R3: Points out an apparent inconsistency: the authors critique omitting task components but themselves advocate omitting derivatives due to collinearity. Questions if using only the canonical HRF (without derivatives) might introduce its own biases, potentially more than alternative methods.
Novelty and Scope of Contribution (Reviewer 1):
Questions whether the paper presents significant methodological advancements, as the effect of correlated regressors is well-known. Suggests the take-home message ("model every theoretically-plausible cause") is obvious.
Connection Between Simulations and Experimental Results (Reviewer 2, Reviewer 3):
R2: Finds a disconnect. Cohen's d values differ markedly between simulation (small) and real data (large). Simulations should more clearly explain specific experimental findings. Different contrasts were used, making direct comparison difficult.
R3: Questions the validity and necessity of simulations, especially if the authors' own model was used to generate the ground truth, which could make comparisons with alternative models circular. Requests clarification on simulation parameters.
Implications for Prior ABCD Research and Strength of Recommendations (Reviewer 2):
Given the reported major differences and the influence of ABCD, the paper needs to address whether prior ABCD MID task publications reported incorrect conclusions and if corrections/retractions are warranted. Finds the paper's recommendations too weak and suggests a clearly stated official ABCD recommendation is needed.
Internal Consistency and Overstated Claims (Reviewer 3):
Highlights apparent inconsistencies in the authors' arguments (e.g., stance on omitting regressors vs. derivatives). Believes claims like the model "corrects all biases" or "accounts for all task components" are overly broad.
Practical Feasibility vs. Theoretical Rigor (Reviewer 3):
Questions if some modeling suggestions are practical or if they introduce new issues (e.g., single regressors for probe/RT might miss subtle cue-type differences). The trade-off in modeling fixation (which constrains cue modeling to only canonical HRF) needs more examination.
Paper Structure and Clarity (Reviewer 2):
Suggests the paper is too long and would benefit from substantial restructuring to improve focus and readability.
Comparison with Previous (AI-Generated) Reviews
Here's a comparison focusing on major concerns:
Major Concerns Common to Both Sets of Reviews (Human & AI):
Simplistic Assumptions about Neural Activity Duration / HRF Modeling:
AI Reviews: Frequently noted that the Saturated model assumes constant neural activation across stimulus duration, which might be an oversimplification. Also raised concerns about omitting temporal derivatives, potentially leading to HRF mismodeling.
Human Reviews (R1, R3): Dive deeper into this. R1 strongly argues for the dissociation between stimulus/RT duration and *actual neural activity patterns* (transient, sustained), advocating for more flexible basis sets. R3 questions the consistency of omitting derivatives while critiquing other omissions and the potential bias of relying solely on the canonical HRF.
Comparison:*Both sets flag this as a core issue with the proposed model's assumptions. Humans provide more nuanced neurophysiological arguments and point out inconsistencies.
Validity/Interpretation of Simulations:
AI Reviews: A very common major concern was the use of independent Gaussian noise, not reflecting fMRI's temporal autocorrelation.
Human Reviews (R2, R3): R2 points to a disconnect between simulation effect sizes/contrasts and real data. R3 questions the fundamental validity if the authors' model generated the ground truth (circularity).
Comparison: Both sets critically examine the simulations. AI focuses on the noise model, while humans focus on the explanatory power, link to real data, and potential circularity of the simulation design.
Major Concerns Primarily or More Strongly Emphasized by Human Reviews:
Novelty and Fundamental Contribution (R1): The concern that the paper might not offer significant new methodological insights beyond well-established GLM principles, and that the core message is "obvious," was a unique and fundamental critique from a human reviewer.
Implications for Prior Published Research & Strength of Recommendations (R2): The call for the paper to address potential errors in existing ABCD literature and to provide stronger, official recommendations is a high-impact concern specific to the human reviews.
Internal Consistency of Arguments (R3): The detailed critique of apparent contradictions within the authors' reasoning (e.g., criteria for omitting regressors) was more pronounced in human reviews.
Deeper Neurophysiological Modeling Considerations (R1, R3): Human reviewers (especially R1) provided more sophisticated arguments about modeling different types of neural responses (e.g., onset vs. sustained) and the potential redundancy of certain regressors (probe/RT) from a cognitive neuroscience perspective.
Paper Structure and Length (R2): This direct critique of the manuscript's organization was specific to human reviews.
Overstated Claims (R3): The concern that the authors overstate the capabilities of their model (e.g., "corrects all biases") was a human-specific critique.
Major Concerns Primarily or More Strongly Emphasized by AI Reviews (or where emphasis differed):
Generalizability (Beyond MID Task, Across Scanners): While human R1 touched on the "obviousness" for any design, the AI reviews more directly and repeatedly flagged the need to demonstrate the Saturated model's utility beyond the MID task and across different scanner types as a specific limitation requiring empirical testing.
cVIF Validation: AI reviews consistently asked for more thorough validation of the cVIF's statistical properties and implementation details. Human R2 had minor points on cVIF naming/definition but didn't flag validation as a major concern with the same intensity.
Model Complexity vs. Statistical Power Trade-off: AI reviews more explicitly worried about potential overfitting from increased model complexity and the impact on statistical power, even if efficiency differences were modest. Human R3 touched on this by questioning if single regressors for probe/RT could capture subtle differences.
Definition of "Baseline": The AI reviews specifically asked for clarity on the definition of baseline in the Saturated model, which was a less prominent point in the human reviews.
In Summary:
Both sets of reviews identified core issues related to the modeling of neural activity/HRF and the simulations. Human reviewers, however, brought in broader concerns about the paper's fundamental novelty, its responsibilities regarding existing literature (especially for a large study like ABCD), the internal consistency of its arguments, and provided deeper neurophysiological critiques of the modeling choices. The AI reviews were more focused on generalizability to other tasks/scanners and the statistical validation of new metrics like cVIF, as well as the implications of model complexity. The human reviews, particularly R1 and R2, also questioned the paper's overall framing and impact more directly.
Which reviews are more useful for the author?
Because we are in the middle of revising this paper, the human reviews were fresh in our minds, so I asked Jeanette Mumford (the first author) to comment on the AI reviews in relation to the human reviews. Here are her comments (thanks to Jeanette for allowing me to post these comments!):
To put it in context: I genuinely value good constructive criticism, not because I doubt my abilities, but because thoughtful feedback always makes the work better. We all operate in our own little research bubbles, and outside perspective helps. Responding to reviewer comments often feels like moving through the stages of grief, with a surprise ending of reluctant gratitude and a stronger paper. If you’re lucky, your reviewer is just outside your bubble but close enough to really get it. I rarely enjoy the process as I’m moving through it, but it feels great when I can read my final responses without cringing.
In short, the AI review did not provide this experience, while human reviewers did. I grouped the AI comments as follows:
Didn’t make much sense
Parroting our own words, often out of context
Occasionally reasonable but minor and not raised by humans
Somewhat aligned with human comments but lacked depth to guide revisions
Summary of concerns:
Overall (5): 3 nonsensical, 2 parroting
Major (9): 4 nonsensical, 3 parroting, 1 reasonable but new, 1 somewhat aligned
Specific (8): 5 nonsensical, 1 parroting, 2 somewhat aligned
When AI comments overlapped with human ones, they mostly echoed things I had already said and were not useful threads to follow. In contrast, some human comments challenged me deeply, like the critique about omitting derivatives from the model. At first, I thought I had handled that well in the original draft, but the reviewers’ probing pushed me to new insights and improvements.
Overall, AI comments might offer some help but miss the deeper, thought-provoking points human reviewers bring. Plus, you lose the fun of guessing who your reviewer might be.
Thus, we have seen a range of performance so far from AI reviewers. In the example I showed in the first installment in this thread, the AI meta-review did a reasonable job of picking out critical issues with the manuscript, while also flagging less useful/more obvious points. In this case, human examination showed that several the points raised by the reviewers were nonsensical or simply parroted things we had said in the paper. This suggests that a lot more work needs to go into assessing the performance of AI models before these models can be relied upon for actual peer review. Nonetheless, the experience so far suggests to me that AI meta-review can be helpful when preparing a manuscript for publication, and my lab is working these reviews into our pre-submission routine.
I hope to write more in the coming weeks about how we might go about doing a deeper comparison of AI vs. human peer reviews.
That last line about a reply you can read without cringing hit me. A good review does feel like a tough conversation that leaves something meaningful behind. AI seems to have learned the form but not the resonance