showSidebars ==
showTitleBreadcrumbs == 1
node.field_disable_title_breadcrumbs.value ==

PhD Dissertation Defense by ZHAO Wei | Evaluating and Enhancing Safety Alignment of Large Language Model

Please click here if you are unable to view this page.

 

Evaluating and Enhancing Safety Alignment of Large Language Model

ZHAO Wei

PhD Candidate
School of Computing and Information Systems
Singapore Management University
 

FULL PROFILE

Research Area

Dissertation Committee

Research Advisor
Committee Members
External Member
  • ZHANG Tianwei, Associate Professor, College of Computing and Date Science (CCDS); Deputy Director, Cyber Security Research Centre @ NTU (CYSREN); Associate Director, NTU Centre Computational Technologies for Finance (CCTF), Nanyang Technological University
 

Date

6 January 2026 (Tuesday)

Time

9:30am - 10:30am

Venue

Meeting room 5.1, 
Level 5
School of Computing and Information Systems 1,
Singapore Management University,
80 Stamford Road
Singapore 178902

Please register by 4 January 2026.

We look forward to seeing you at this research seminar.

 

ABOUT THE TALK

Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) demonstrate remarkable capabilities across diverse applications, yet they remain vulnerable to adversarial attacks through carefully crafted prompts and harmful visual inputs that circumvent safety mechanisms. Despite considerable efforts in reinforcement learning from human feedback (RLHF) and supervised fine-tuning, existing safeguards prove inadequate because these models operate as black-boxes without explanations for their decisions, making security vulnerabilities difficult to identify and eliminate. Addressing these challenges fundamentally requires understanding the inner safety mechanisms of these models to develop targeted mitigation strategies that can effectively defend against attacks.

This dissertation presents four interconnected contributions to improve LLM and MLLM security through mechanistic understanding. We propose CASPER, a causality analysis framework operating at token, layer, and neuron levels that reveals how RLHF creates brittle overfitting to known harmful prompts. Building on these insights, we introduce Layer-specific Editing (LED), which identifies and realigns critical safety layers to defend against jailbreak attacks while maintaining utility. For MLLMs, we develop SafeCLIP, a zero-shot toxic image detection method that leverages inherent multimodal alignment by repurposing the vision encoder's CLS token, achieving high defense rates with minimal overhead. Finally, we present Q-MLLM, a unified architecture employing two-level vector quantization that simultaneously defends against both adversarial perturbations and toxic visual inputs by creating discrete bottlenecks in visual representations, achieving near-perfect defense rates while preserving multimodal reasoning capabilities.

 

SPEAKER BIOGRAPHY

Wei ZHAO is a Ph.D. Candidate in Computer Science at Singapore Management University, under the supervision of Professor Jun SUN. His research focuses on improving large model safety through understanding and enhancing the inner mechanisms of Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs). His PhD research addresses critical security vulnerabilities in these models, spanning from causality analysis for security evaluation to layer-specific defense methods and cross-modal safety alignment. His work has been published at prestigious venues including the Conference on Empirical Methods in Natural Language Processing (EMNLP) 2024 and 2025, and accepted at the Network and Distributed System Security (NDSS) Symposium 2026. He previously earned his bachelor's degree from the TianJin University, majoring in Software Engineering.