Research Seminar by TU Haoxin and CHEN Zhi

Please click here if you are unable to view this page.

Research Seminar by TU Haoxin and CHEN Zhi

DATE :	24 October 2024, Thursday
TIME :	11:00am to 12:00pm
VENUE :	Meeting room 5.1, Level 5 School of Computing and Information Systems 1, Singapore Management University, 80 Stamford Road, Singapore 178902 Please register by 23 October 2024

There are 2 talks in this session, each talk is approximately 30 minutes. All sessions are for pre-conference talk for The 39th IEEE/ACM International Conference on Automated Software Engineering (ASE 2024).

About the Talk (s)

Talk #1: Concretely Mapped Symbolic Memory Locations for Memory Error Detection
by TU Haoxin, PhD Candidate

Memory allocation is fundamental for managing memory objects in many programming languages. Misusing allocated memory objects (e.g., buffer overflow and use-after-free) can have catastrophic consequences. Symbolic execution-based approaches have shown great potential but still suffer from fundamental limitations in modeling dynamic memory layouts; they either represent the locations of memory objects as concrete addresses or represent the locations as simple symbolic variables without sufficient constraints. Such limitations hinder the existing symbolic execution engines from effectively detecting certain memory errors. In this study, we propose SymLoc, a symbolic execution-based approach that uses concretely mapped symbolic memory locations to alleviate these limitations. Specifically, a new integration of three techniques is designed in SymLoc: (1) the symbolization of addresses and encoding of symbolic addresses into path constraints, (2) the symbolic memory read/write operations, and (3) the automatic tracking of the uses of symbolic memory locations. Our evaluation results show that SymLoc can detect 23 more unique spatial memory errors over real-world programs and 8%-64% more temporal memory errors over the Juliet Test Suite than various existing state-of-the-art memory error detectors.

Talk #2: Promise and Peril of Collaborative Code Generation Models: Balancing Effectiveness and Memorization
by CHEN Zhi, PhD Candidate

In the evolving field of machine learning, training models with datasets from various locations and organizations presents significant challenges due to privacy and legal concerns. Exploring effective collaborative training settings that leverage knowledge from distributed, isolated datasets is crucial. This study investigates key factors affecting the effectiveness of collaborative training methods in code next-token prediction, as well as the correctness and utility of the generated code. We evaluate the memorization of participant data across centralized, federated, and incremental training, highlighting the risks of data leakage. Our findings reveal that dataset size and diversity are pivotal to the success of collaboratively trained code models. Federated learning achieves competitive performance compared to centralized training while offering better data protection, as evidenced by lower memorization ratios. However, federated learning may still produce verbatim code snippets from hidden training data, raising privacy or copyright concerns. We further explore patterns of effectiveness and memorization in incremental learning, emphasizing the sequence of dataset introduction. Additionally, we identify the memorization of cross-organizational clones as a prevalent challenge in centralized and federated learning. Our results underscore the persistent risk of data leakage during inference, even with unseen training data. We conclude with recommendations for optimizing the use of multisource datasets to enhance cross-organizational collaboration.

About the Speaker (s)

		Haoxin TU is a Dual-degree Ph.D. candidate at SMU and DUT (Dalian University of Technology), and he have earned his first Ph.D. degree at DUT in December 2023. At SMU, he is supervised by Prof. Lingxiao JIANG and Prof. Xuhua DING. His research focuses on developing practical techniques and tools that can help improve the reliability and security of software systems (mainly system software such as compilers and Linux kernels). More information is available at https://haoxintu.github.io/.

		CHEN Zhi is a second-year Ph.D. student in Computer Science at Singapore Management University (SMU), under the supervision of Prof. JIANG Lingxiao. His research focuses on evaluating and exploring strategies to enhance code generation models.

Where to find us

Get in touch