Research Seminar by ZHAO Wei

Please click here if you are unable to view this page.

Research Seminar by ZHAO Wei

DATE :	24 October 2024, Thursday
TIME :	1:00pm to 2:00pm
VENUE :	Meeting room 5.1, Level 5 School of Computing and Information Systems 1, Singapore Management University, 80 Stamford Road, Singapore 178902 Please register by 23 October 2024

*There are 2 talks in this session, each talk is approximately 30 minutes.*

About the Talk (s)

Talk #1: Defending Large Language Models Against Jailbreak Attacks via Layer-specific Editing

Large language models (LLMs) are increasingly being adopted in a wide range of real-world applications. Despite their impressive performance, recent studies have shown that LLMs are vulnerable to deliberately crafted adversarial prompts. In this work, we investigate how LLMs respond to harmful prompts and propose a novel defense method termed Layer-specific Editing (LED) to enhance the resilience of LLMs against jailbreak attacks. Through LED, we reveal that several critical safety layers exist among the early layers of LLMs. We then show that realigning these safety layers (and some selected additional layers) with the decoded safe response from identified toxic layers can significantly improve the alignment of LLMs against jailbreak attacks. Extensive experiments across various LLMs (e.g., Llama2, Mistral) show the effectiveness of LED, which effectively defends against jailbreak attacks while maintaining performance on benign prompts.

This is a Pre-Conference talk for The 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP 2024).

Talk #2: ADVERSARIAL SUFFIXES MAY BE FEATURES TOO!

Despite significant ongoing efforts in safety alignment, large language models(LLMs) remain vulnerable to jailbreak attacks that can induce harmful behaviors, including those triggered by adversarial suffixes. Building on prior research, we hypothesize that these adversarial suffixes are not mere bugs but may represent features that can dominate the LLM’s behavior. To evaluate this hypothesis, we conduct several experiments. First, we demonstrate that benign features can be effectively made to function as adversarial suffixes, i.e., we develop a feature extraction method to extract sample-agnostic features from benign dataset in the form of suffixes and show that these suffixes may effectively compromise safety alignment. Second, we show that adversarial suffixes gener- ated from jailbreak attacks may contain meaningful features, i.e., appending the same suffix to different prompts results in responses exhibiting specific character- istics. Third, we show that such benign-yet-safety-compromising features can be easily introduced through fine-tuning using only benign datasets, i.e., even in the absence of harmful content. This highlights the critical risk posed by dominating benign features in the training data and calls for further research to reinforce LLM safety alignment.

This is a Pre-Conference talk for The Thirteenth International Conference on Learning Representations (ICLR 2025).

About the Speaker

ZHAO Wei is a PhD Candidate in Computer Science at the SMU School of Computing and Information Systems, supervised by Prof. Sun Jun. His research is focused on LLM Safety.

Where to find us

Get in touch