| |
From Code to Courtroom: LLMs as the New Software Judges Speaker (s):
 HE Junda PhD Candidate, School of Computing and Information Systems Singapore Management University
| Date: Time: Venue: | | 10 June 2025, Tuesday 4:00pm – 4:30pm Meeting room 4.4, Level 4. School of Computing and Information Systems 1, Singapore Management University, 80 Stamford Road, Singapore 178902 We look forward to seeing you at this research seminar. Please register by 8 June 2025. 
|
|
About the Talk Recently, Large Language Models (LLMs) have been increasingly used to automate SE tasks such as code generation and summarization. However, evaluating the quality of LLM-generated software artifacts remains challenging. Human evaluation, while effective, is very costly and time-consuming. Traditional automated metrics like BLEU rely on high-quality references and struggle to capture nuanced aspects of software quality, such as readability and usefulness. In response, the LLM-as-a-Judge paradigm, which employs LLMs for automated evaluation, has emerged.
Given that LLMs are typically trained to align with human judgment and possess strong coding abilities and reasoning skills, they hold promise as cost-effective and scalable surrogates for human evaluators. Nevertheless, LLM-as-a-Judge research in the SE community is still in its early stages, with many breakthroughs needed.
This forward-looking SE 2030 paper aims to steer the research community toward advancing LLM-as-a-Judge for evaluating LLM-generated software artifacts, while also sharing potential research paths to achieve this goal. We provide a literature review of existing SE studies on LLM-as-a-Judge and envision these frameworks as reliable, robust, and scalable human surrogates capable of evaluating software artifacts with consistent, multi-faceted assessments by 2030 and beyond. To validate this vision, we analyze the limitations of current studies, identify key research gaps, and outline a detailed roadmap to guide future developments of LLM-as-a-Judge in software engineering. While not intended to be a definitive guide, our work aims to foster further research and adoption of LLM-as-a-Judge frameworks within the SE community, ultimately improving the effectiveness and scalability of software artifact evaluation methods.
This is a Pre-Conference talk for ACM International Conference on the Foundations of Software Engineering (FSE 2025). About the Speaker Junda He is a Ph.D. candidate in Computer Science at the School of Computing and Information Systems, Singapore Management University, under the supervision of Professor David Lo. His research focuses on various aspects of Large Language Models for Software Engineering (LLM4SE) and Trustworthy AI, with outcomes published in premier venues such as ICSE, TOSEM, TSE, and ASE. He also serves as a reviewer for leading journals and conferences, including CACM, TSE, TOSEM, ACL, and UIST. In recognition of his academic achievements, he has been awarded the SMU Presidential Doctoral Fellowship. Outside of his academic pursuits, he enjoys participating in outdoor sports.
|