Research & Resources
Exploring the frontiers of AI safety through rigorous academic research and open-source collaboration.
Featured Publications
GoldCoin: Grounding Large Language Models in Privacy Laws via Contextual Integrity Theory
W Fan, H Li, Z Deng, W Wang, Y Song
A novel framework that grounds LLMs in privacy laws using Contextual Integrity theory to ensure compliant data handling.
MCIP: Protecting MCP Safety via Model Contextual Integrity Protocol
H Jing, H Li, W Hu, Q Hu, H Xu, T Chu, P Hu, Y Song
Introduces the Model Contextual Integrity Protocol (MCIP) to safeguard MCP systems from privacy and safety violations.
Context Reasoner: Incentivizing Reasoning Capability for Contextualized Privacy and Safety Compliance via Reinforcement Learning
W Hu, H Li, H Jing, Q Hu, Z Zeng, S Han, H Xu, T Chu, P Hu, Y Song
Leverages reinforcement learning to enhance reasoning capabilities for context-aware privacy and safety compliance in AI systems.
All Publications
Backdoor-Powered Prompt Injection Attacks Nullify Defense Methods
Y Chen, H Li, Y Sui, Y Song, B Hooi
TopicAttack: An Indirect Prompt Injection Attack via Topic Transition
Y Chen, H Li, Y Li, Y Liu, Y Song, B Hooi
RewardDS: Privacy-Preserving Fine-Tuning for Large Language Models via Reward Driven Data Synthesis
J Wang, J Yang, H Li, H Zhuang, C Chen, Z Zeng
Can Indirect Prompt Injection Attacks Be Detected and Removed?
Y Chen, H Li, Y Sui, Y He, Y Liu, Y Song, B Hooi
PrivaCI-Bench: Evaluating Privacy with Contextual Integrity and Legal Compliance
H Li, W Hu, H Jing, Y Chen, Q Hu, S Han, T Chu, P Hu, Y Song
PrivacyRestore: Privacy-Preserving Inference in Large Language Models via Privacy Removal and Restoration
Z Zeng, J Wang, J Yang, Z Lu, H Li, H Zhuang, C Chen
Simulate and Eliminate: Revoke Backdoors for Generative Large Language Models
H Li, Y Chen, Z Zheng, Q Hu, C Chan, H Liu, Y Song
Privacy Checklist: Privacy Violation Detection Grounding on Contextual Integrity Theory
H Li, W Fan, Y Chen, J Cheng, T Chu, X Zhou, P Hu, Y Song
Adaptive Differentially Private Structural Entropy Minimization for Unsupervised Social Event Detection
Z Yang, Y Wei, H Li, Q Li, L Jiang, L Sun, X Yu, C Hu, H Peng
Privacy-Preserved Neural Graph Databases
Q Hu, H Li, J Bai, Z Wang, Y Song
PrivLM-Bench: A Multi-level Privacy Evaluation Benchmark for Language Models
H Li, D Guo, D Li, W Fan, Q Hu, X Liu, C Chan, D Yao, Y Yao, Y Song
Sentence Embedding Leaks More Information than You Expect: Generative Embedding Inversion Attack to Recover the Whole Sentence
H Li, M Xu, Y Song
Multi-step Jailbreaking Privacy Attacks on ChatGPT
H Li, D Guo, W Fan, M Xu, J Huang, F Meng, Y Song
You Don't Know My Favorite Color: Preventing Dialogue Representations from Revealing Speakers' Private Personas
H Li, Y Song, L Fan
Differentially Private Federated Knowledge Graphs Embedding
H Peng, H Li, Y Song, V Zheng, J Li
Method and Server for Defending Service from Personal Privacy Inference Attack
Y Song, H Li
GSPR: Aligning LLM Safeguards as Generalizable Safety Policy Reasoners
H Li, Y Chen, J Zeng, H Peng, H Jing, W Hu, X Yang, Z Zeng, S Han, et al.
Safety Compliance: Rethinking LLM Safety Reasoning through the Lens of Compliance
W Hu, H Jing, H Shi, H Li, Y Song
Privacy in Large Language Models: Attacks, Defenses and Future Directions
H Li, Y Chen, J Luo, J Wang, H Peng, Y Kang, X Zhang, Q Hu, C Chan, et al.
ARGUS: Defending Against Multimodal Indirect Prompt Injection via Steering Instruction-Following Behavior
W Lu, Z Zeng, K Zhang, H Li, H Zhuang, R Wang, C Chen, H Peng
SafeMT: Multi-turn Safety for Multimodal Language Models
H Zhu, J Dai, J Ji, H Li, C Cai, P Wen, CM Chan, B Chen, Y Yang, S Han, et al.
MASLegalBench: Benchmarking Multi-Agent Systems in Deductive Legal Reasoning
H Jing, W Hu, H Luo, J Yang, W Fan, H Li, Y Song
Activation-Guided Local Editing for Jailbreaking Attacks
J Wang, H Li, H Peng, Z Zeng, Z Wang, H Du, Z Yu
Legal Rule Induction: Towards Generalizable Principle Discovery from Analogous Judicial Precedents
W Fan, T Zheng, Y Hu, Z Deng, W Wang, B Xu, C Li, H Li, W Shen, Y Song
Robustness via Referencing: Defending against Prompt Injection Attacks by Referencing the Executed Instruction
Y Chen, H Li, Y Sui, Y Liu, Y He, Y Song, B Hooi
BATHE: Defense Against the Jailbreak Attack in Multimodal Large Language Models by Treating Harmful Instruction as Backdoor Trigger
Y Chen, H Li, Y Zhang, Z Zheng, Y Song, B Hooi