|
Evaluating and Enhancing Safety Alignment of Large Language Model |  | ZHAO Wei PhD Candidate School of Computing and Information Systems Singapore Management University | Research Area Dissertation Committee Research Advisor Dissertation Committee Members |
| | Date 6 November 2024 (Wednesday) | Time 1:00pm – 2:00pm | Venue Meeting Room 5.1, Level 5 School of Computing and Information Systems 1, Singapore Management University, 80 Stamford Road Singapore 178902 | Please register by 5 November 2024. We look forward to seeing you at this research seminar. 
|
|
|
| ABOUT THE TALK Large Language Models (LLMs) have transformed the field of natural language processing, but concerns about their security and reliability persist. This dissertation investigates advanced techniques for assessing and improving LLM security. First, we present CASPER, a framework for lightweight causality analysis that facilitates the evaluation of LLM behavior at both the layer and neuron levels. Building on these findings, we introduce Layer-specific Editing (LED), a method based on knowledge editing to enhance LLM alignment against adversarial attacks. Furthermore, our detailed examination of adversarial suffixes uncovers their role as significant features within LLMs, while also showing that fine-tuning with benign data can degrade safety alignment. This research deepens the understanding of LLM security and offers practical tools for improving model safety alignment. | | ABOUT THE SPEAKER ZHAO Wei is a PhD Candidate in Computer Science at the SMU School of Computing and Information Systems, supervised by Prof. SUN Jun. His research is focused on LLM Safety. |
|