Summary: Large language models rely on safety alignment to avoid generating harmful content. Jailbreak can bypass these safety measures, leading to concerns about LLM safety. This study uses weak classifiers to explain LLM safety through intermediate hidden states, revealing the intricate mechanisms of alignment and jailbreak. Experiments across various model families confirm the effectiveness of the proposed explanations in understanding and improving LLM safety. The code is available at GitHub.
Opinion: Understanding the inner workings of LLM safety mechanisms is crucial for developing robust and secure AI systems. This paper sheds light on the vulnerabilities that can compromise LLM safety and provides a pathway for strengthening safety protocols. Further research can explore practical implementations of the explained mechanisms in real-world LLM applications.