Research Publications

Research & Resources

Exploring the frontiers of AI safety through rigorous academic research and open-source collaboration.

Featured Publications

EMNLP 2024Outstanding Paper

GoldCoin: Grounding Large Language Models in Privacy Laws via Contextual Integrity Theory

W Fan, H Li, Z Deng, W Wang, Y Song

A novel framework that grounds LLMs in privacy laws using Contextual Integrity theory to ensure compliant data handling.

EMNLP 2025

MCIP: Protecting MCP Safety via Model Contextual Integrity Protocol

H Jing, H Li, W Hu, Q Hu, H Xu, T Chu, P Hu, Y Song

Introduces the Model Contextual Integrity Protocol (MCIP) to safeguard MCP systems from privacy and safety violations.

EMNLP 2025

Context Reasoner: Incentivizing Reasoning Capability for Contextualized Privacy and Safety Compliance via Reinforcement Learning

W Hu, H Li, H Jing, Q Hu, Z Zeng, S Han, H Xu, T Chu, P Hu, Y Song

Leverages reinforcement learning to enhance reasoning capabilities for context-aware privacy and safety compliance in AI systems.

All Publications

Backdoor-Powered Prompt Injection Attacks Nullify Defense Methods

Y Chen, H Li, Y Sui, Y Song, B Hooi

EMNLP 2025 Findings

TopicAttack: An Indirect Prompt Injection Attack via Topic Transition

Y Chen, H Li, Y Li, Y Liu, Y Song, B Hooi

EMNLP 2025

RewardDS: Privacy-Preserving Fine-Tuning for Large Language Models via Reward Driven Data Synthesis

J Wang, J Yang, H Li, H Zhuang, C Chen, Z Zeng

EMNLP 2025

Can Indirect Prompt Injection Attacks Be Detected and Removed?

Y Chen, H Li, Y Sui, Y He, Y Liu, Y Song, B Hooi

ACL 2025

PrivaCI-Bench: Evaluating Privacy with Contextual Integrity and Legal Compliance

H Li, W Hu, H Jing, Y Chen, Q Hu, S Han, T Chu, P Hu, Y Song

ACL 2025

PrivacyRestore: Privacy-Preserving Inference in Large Language Models via Privacy Removal and Restoration

Z Zeng, J Wang, J Yang, Z Lu, H Li, H Zhuang, C Chen

ACL 2025

Simulate and Eliminate: Revoke Backdoors for Generative Large Language Models

H Li, Y Chen, Z Zheng, Q Hu, C Chan, H Liu, Y Song

AAAI 2025

Privacy Checklist: Privacy Violation Detection Grounding on Contextual Integrity Theory

H Li, W Fan, Y Chen, J Cheng, T Chu, X Zhou, P Hu, Y Song

NAACL 2025

Adaptive Differentially Private Structural Entropy Minimization for Unsupervised Social Event Detection

Z Yang, Y Wei, H Li, Q Li, L Jiang, L Sun, X Yu, C Hu, H Peng

CIKM 2024

Privacy-Preserved Neural Graph Databases

Q Hu, H Li, J Bai, Z Wang, Y Song

KDD 2024

PrivLM-Bench: A Multi-level Privacy Evaluation Benchmark for Language Models

H Li, D Guo, D Li, W Fan, Q Hu, X Liu, C Chan, D Yao, Y Yao, Y Song

ACL 2024Oral

Sentence Embedding Leaks More Information than You Expect: Generative Embedding Inversion Attack to Recover the Whole Sentence

H Li, M Xu, Y Song

Findings of ACL 2023

Multi-step Jailbreaking Privacy Attacks on ChatGPT

H Li, D Guo, W Fan, M Xu, J Huang, F Meng, Y Song

Findings of EMNLP 2023

You Don't Know My Favorite Color: Preventing Dialogue Representations from Revealing Speakers' Private Personas

H Li, Y Song, L Fan

NAACL 2022

Differentially Private Federated Knowledge Graphs Embedding

H Peng, H Li, Y Song, V Zheng, J Li

CIKM 2021

Method and Server for Defending Service from Personal Privacy Inference Attack

Y Song, H Li

US Patent

GSPR: Aligning LLM Safeguards as Generalizable Safety Policy Reasoners

H Li, Y Chen, J Zeng, H Peng, H Jing, W Hu, X Yang, Z Zeng, S Han, et al.

arXiv 2025

Safety Compliance: Rethinking LLM Safety Reasoning through the Lens of Compliance

W Hu, H Jing, H Shi, H Li, Y Song

arXiv 2025

Privacy in Large Language Models: Attacks, Defenses and Future Directions

H Li, Y Chen, J Luo, J Wang, H Peng, Y Kang, X Zhang, Q Hu, C Chan, et al.

arXiv 2023

ARGUS: Defending Against Multimodal Indirect Prompt Injection via Steering Instruction-Following Behavior

W Lu, Z Zeng, K Zhang, H Li, H Zhuang, R Wang, C Chen, H Peng

arXiv 2025

SafeMT: Multi-turn Safety for Multimodal Language Models

H Zhu, J Dai, J Ji, H Li, C Cai, P Wen, CM Chan, B Chen, Y Yang, S Han, et al.

arXiv 2025

MASLegalBench: Benchmarking Multi-Agent Systems in Deductive Legal Reasoning

H Jing, W Hu, H Luo, J Yang, W Fan, H Li, Y Song

arXiv 2025

Activation-Guided Local Editing for Jailbreaking Attacks

J Wang, H Li, H Peng, Z Zeng, Z Wang, H Du, Z Yu

arXiv 2025

Legal Rule Induction: Towards Generalizable Principle Discovery from Analogous Judicial Precedents

W Fan, T Zheng, Y Hu, Z Deng, W Wang, B Xu, C Li, H Li, W Shen, Y Song

arXiv 2025

Robustness via Referencing: Defending against Prompt Injection Attacks by Referencing the Executed Instruction

Y Chen, H Li, Y Sui, Y Liu, Y He, Y Song, B Hooi

arXiv 2025

BATHE: Defense Against the Jailbreak Attack in Multimodal Large Language Models by Treating Harmful Instruction as Backdoor Trigger

Y Chen, H Li, Y Zhang, Z Zheng, Y Song, B Hooi

arXiv 2024