Talk #1: Concretely Mapped Symbolic Memory Locations for Memory Error Detection by TU Haoxin, PhD Candidate
Memory allocation is fundamental for managing memory objects in many programming languages. Misusing allocated memory objects (e.g., buffer overflow and use-after-free) can have catastrophic consequences. Symbolic execution-based approaches have shown great potential but still suffer from fundamental limitations in modeling dynamic memory layouts; they either represent the locations of memory objects as concrete addresses or represent the locations as simple symbolic variables without sufficient constraints. Such limitations hinder the existing symbolic execution engines from effectively detecting certain memory errors. In this study, we propose SymLoc, a symbolic execution-based approach that uses concretely mapped symbolic memory locations to alleviate these limitations. Specifically, a new integration of three techniques is designed in SymLoc: (1) the symbolization of addresses and encoding of symbolic addresses into path constraints, (2) the symbolic memory read/write operations, and (3) the automatic tracking of the uses of symbolic memory locations. Our evaluation results show that SymLoc can detect 23 more unique spatial memory errors over real-world programs and 8%-64% more temporal memory errors over the Juliet Test Suite than various existing state-of-the-art memory error detectors.
Talk #2: Promise and Peril of Collaborative Code Generation Models: Balancing Effectiveness and Memorization by CHEN Zhi, PhD Candidate
In the evolving field of machine learning, training models with datasets from various locations and organizations presents significant challenges due to privacy and legal concerns. Exploring effective collaborative training settings that leverage knowledge from distributed, isolated datasets is crucial. This study investigates key factors affecting the effectiveness of collaborative training methods in code next-token prediction, as well as the correctness and utility of the generated code. We evaluate the memorization of participant data across centralized, federated, and incremental training, highlighting the risks of data leakage. Our findings reveal that dataset size and diversity are pivotal to the success of collaboratively trained code models. Federated learning achieves competitive performance compared to centralized training while offering better data protection, as evidenced by lower memorization ratios. However, federated learning may still produce verbatim code snippets from hidden training data, raising privacy or copyright concerns. We further explore patterns of effectiveness and memorization in incremental learning, emphasizing the sequence of dataset introduction. Additionally, we identify the memorization of cross-organizational clones as a prevalent challenge in centralized and federated learning. Our results underscore the persistent risk of data leakage during inference, even with unseen training data. We conclude with recommendations for optimizing the use of multisource datasets to enhance cross-organizational collaboration.
About the Speaker (s)
Haoxin TU is a Dual-degree Ph.D. candidate at SMU and DUT (Dalian University of Technology), and he have earned his first Ph.D. degree at DUT in December 2023. At SMU, he is supervised by Prof. Lingxiao JIANG and Prof. Xuhua DING. His research focuses on developing practical techniques and tools that can help improve the reliability and security of software systems (mainly system software such as compilers and Linux kernels). More information is available at https://haoxintu.github.io/.
CHEN Zhi is a second-year Ph.D. student in Computer Science at Singapore Management University (SMU), under the supervision of Prof. JIANG Lingxiao. His research focuses on evaluating and exploring strategies to enhance code generation models.