close
close

ChatGPT is transforming peer review – how can we use it responsibly?

ChatGPT is transforming peer review – how can we use it responsibly?

Since ChatGPT’s artificial intelligence (AI) chatbot was released in late 2022, computer scientists have noticed a disturbing trend: chatbots are increasingly being used to review research papers that end up in major conference proceedings.

There are several characteristic signs. Reviews written using artificial intelligence tools stand out for their formal tone and verbosity—traits typically associated with the writing style of large language models (LLMs). For example, words like “commendable” and “meticulous” are now ten times more common in peer reviews than before 2022. AI-generated reviews also tend to be superficial and generalized, often failing to mention specific sections of the submitted article and lacking references. .

That’s what my colleagues and I at Stanford University in California found when we analyzed nearly 50,000 reviews of computer science papers published in conference proceedings in 2023 and 2024. We estimate that 7–17% of review sentences were written by certified lawyers. based on writing style and frequency of use of certain words (V. Liang etc. Textbook 41st International Conf. Max. Study. 23529575–29620; 2024).

Lack of time may be one of the reasons for using LLM to write expert reviews. We found that the proportion of text produced by LLMs was higher in reviews submitted closer to the deadline. This trend will only intensify. Editors are already struggling to provide timely reviews, and reviewers are inundated with requests.

Fortunately, artificial intelligence systems can help solve the problem they created. To achieve this, the use of LLM should be limited to specific tasks—for example, correcting language and grammar, answering simple questions about a manuscript, and identifying relevant information. However, when used irresponsibly, LLM programs risk undermining the integrity of the scientific process. It is therefore critical and urgent that the scientific community establish norms for the responsible use of these models in the academic peer review process.

First, it is important to recognize that the current generation of LLMs cannot replace expert reviewers. Despite their capabilities, LLM students are unable to demonstrate deep scientific reasoning. They also sometimes cause meaningless reactions known as hallucinations. A common complaint from researchers who had their manuscripts reviewed as part of an LL.M. degree was that the reviews lacked technical depth, especially in terms of methodological criticism (V. Liang etc. NEZHM II 1AIoa2400196; 2024). An LL.M may also easily overlook errors in a research paper.

With these caveats in mind, thoughtful design and guarding are required when deploying LLMs. For reviewers, an AI chatbot assistant can provide feedback on how to make vague proposals more useful to authors before submitting a review. It is also possible to highlight sections of the article potentially missed by the reviewer that already address issues raised in the review.

To assist editors, LLMs can retrieve and summarize related papers to help them contextualize work and verify adherence to submission checklists (for example, to ensure that statistics are presented correctly). These are relatively low-risk LLM applications that can save reviewers and editors time when implemented correctly.

However, LLMs can make mistakes even when performing low-risk information retrieval and summarization tasks. Therefore, LLM results should be considered as a starting point and not as a final answer. Users should still double-check the operation of the LLM.

Journals and conferences may be tempted to use artificial intelligence algorithms to detect LLM use in reviews and articles, but their effectiveness is limited. While such detectors can identify obvious examples of AI-generated text, they are prone to producing false positives—for example, flagging text written by scientists whose first language is not English as AI-generated. Users can also avoid detection by strategically prompting LLMs. Detectors often have difficulty distinguishing between reasonable use of LLM—for example, to refine raw text—from inappropriate use, such as using a chatbot to write an entire report.

Ultimately, the best way to prevent AI from dominating peer review is to encourage more human interaction during the process. Platforms such as OpenReview encourage reviewers and authors to interact anonymously, resolving issues through multiple rounds of discussion. OpenReview is now used at several major computer science conferences and journals.

The wave of LLM use in scientific writing and peer review cannot be stopped. To achieve this transformation, journals and conference venues must establish clear guidelines and implement systems to adhere to them. At the very least, journals should ask reviewers to openly disclose whether and how they use LLMs in the review process. We also need innovative interactive peer review platforms adapted to the AI ​​era that can automatically limit the use of LLMs to a limited set of tasks. In parallel, we need much more research on how AI can responsibly assist with certain peer review tasks. Establishing public norms and resources will help ensure that LLM benefits reviewers, editors, and authors without compromising the integrity of the scientific process.

Competing interests

The author declares no competing interests.