PhD Dissertation Proposal by ZHOU Xin | Understanding and Enhancing Large Language Models of Code for Software Engineering Tasks

Please click here if you are unable to view this page.

Understanding and Enhancing Large Language Models of Code for Software Engineering Tasks

ZHOU Xin

PhD Candidate
School of Computing and Information Systems
Singapore Management University

FULL PROFILE

Research Area

Information Systems & Technology
- Software Engineering

Dissertation Committee

Research Advisor

OUB Chair Prof David LO

Committee Members

Date

18 December 2023 (Monday)

Time

10:00am - 11:00am

Venue

Meeting room 5.1, Level 5
School of Computing and Information Systems 1,
Singapore Management University,
80 Stamford Road
Singapore 178902

Please register by 17 December 2023.

We look forward to seeing you at this research seminar.

About The Talk

Software engineering involves many tasks across different phases such as design, coding, testing, and deployment. To boost developer productivity, in recent years, numerous research endeavors in software engineering have sought to automate certain manual tasks through the application of machine learning techniques.

Since 2020, the emergence of powerful Large Language Models of Code (CodeLLMs) has positioned them as foundational models for a wide range of software engineering tasks. It's crucial to note, however, that CodeLLMs are not the silver bullet for every software engineering task. In certain scenarios, CodeLLMs may not deliver satisfactory performance. Therefore, understanding the limitations of current CodeLLMs and working on their enhancement are essential areas of study. In brief, the primary aim of this dissertation is to identify the limitations or areas requiring improvement in CodeLLMs. The subsequent goal is to propose enhancements to existing CodeLLMs that effectively address the identified limitations.

In the first study of this dissertation, we identify a specific limitation in existing CodeLLMs that are pre-trained on code snippets and documentation: they struggle to generalize effectively when it comes to code changes. To address this issue, in the second study, we propose a new CodeLLM named CCBERT (Code Change BERT) based on the properties of code changes. In the third study, we emphasize the effectiveness of enhancing CodeLLMs by incorporating relevant expert knowledge. To illustrate this idea, we apply this idea to the task of repairing software vulnerabilities and propose VulMaster. In the fourth study, we analyze how the long-tailed distribution impacts the performance of popular CodeLLMs. As the impact of the long-tailed data distribution on CodeLLMs is not yet fully understood, this study enhances our comprehension of CodeLLM limitations from a data-centric perspective, specifically focusing on long-tailed distributions prevalent in real-world datasets. Lastly, the ongoing and planned work to alleviate the negative impact of the long-tailed distribution on CodeLLMs. We plan to propose a solution that accentuates the existence of infrequent data in CodeLLMs while preserving their ability to handle more common data.

Speaker Biography

ZHOU, Xin is a Ph.D. candidate in SCIS, under the supervision of Prof. David LO. Xin's research focuses on pre-trained code representation and automation for software maintenance and development.

Where to find us

Get in touch