华东师范大学学报(自然科学版) ›› 2025, Vol. 2025 ›› Issue (1): 138-150.doi: 10.3969/j.issn.1000-5641.2025.01.011

• 计算机科学 • 上一篇    

基于CLIP微调的扩散模型安全化

吴平, 林欣*()   

  1. 华东师范大学 计算机科学与技术学院, 上海 200062
  • 收稿日期:2024-01-11 出版日期:2025-01-25 发布日期:2025-01-20
  • 通讯作者: 林欣 E-mail:xlin@cs.ecnu.edu.cn
  • 基金资助:
    统计与数据科学前沿理论及应用教育部重点实验室开放项目; 上海市科委项目 (21511100101)

Purging diffusion models through CLIP based fine-tuning

Ping WU, Xin LIN*()   

  1. School of Computer Science and Technology, East China Normal University, Shanghai 200062, China
  • Received:2024-01-11 Online:2025-01-25 Published:2025-01-20
  • Contact: Xin LIN E-mail:xlin@cs.ecnu.edu.cn

摘要:

扩散模型变革了文本–图像生成领域, 使终端用户可以基于简单的自然语言提示生成高质量、多样化的图像艺术作品. 然而, 由于训练数据集庞大且未经过滤, 文本–图像生成模型具有生成色情内容与暴力内容等不适当内容的能力. 为更加安全地部署此类模型, 提出了一种基于CLIP (contrastive language-image pre-training) 方向性损失的微调 (directional CLIP loss based fine-tuning, CLIF)算法, 使用方向性的CLIP损失来微调模型, 以抑制其生成不适当内容的能力. CLIF消耗的计算资源很少, 并且具有强制生效的特点. 为评估其抑制效果, 提出了CTP (categorized toxic prompts)用于评估文本–图像生成模型的不适当内容生成能力. 在CTP与COCO (common objects in context) 上的实验结果表明, CLIF能够在抑制文本–图像扩散模型生成不安全内容的同时不影响其一般性生成能力.

关键词: 文本–图像生成模型, 安全性, 数据集, 扩散模型

Abstract:

Diffusion models have revolutionized text-to-image synthesis, enabling users to generate high-quality and imaginative artworks from simple natural-language text prompts. Unfortunately, due to the large and unfiltered training dataset, inappropriate content such as nudity and violence can be generated from them. To deploy such models at a higher level of safety, we propose a novel method, directional contrastive language-image pre-training (CLIP) loss-based fine-tuning, dubbed as CLIF. This method utilizes directional CLIP loss to suppress the model’s inappropriate generation ability. CLIF is lightweight and immune to circumvention. To demonstrate the effectiveness of CLIF, we proposed a benchmark called categorized toxic prompts (CTP) to evaluate the ability to generate inappropriate content for text-to-image diffusion models. As shown by our experiments on CTP and common objects in context (COCO) datasets, CLIF is capable of significantly suppressing inappropriate generation while preserving the model’s ability to produce general content.

Key words: text-to-image generative models, security, datasets, diffusion models

中图分类号: