: Holistic Safety Benchmarking and Modeling for Vision-Language Model

Youngwan Lee¹², Kangsan Kim², Kwanyong Park³, Ilchae Jung¹, Sujin Jang¹,

Seanie Lee², Yong-Ju Lee¹, Sung Ju Hwang²⁴

¹ETRI, ²KAIST, ³University of Seoul ⁴DeepAuto.ai

🤗

Dataset Code (Coming Soon) Leaderboard Examples

⚠️ Warning: this project page contains harmful content. ⚠️

An example of the HoliSafe, a comprehensive dataset that covers all combinations of image and text safeness (safe/unsafe image with safe/unsafe text), and a corresponding evaluation benchmark, HoliSafe-Bench, which poses novel challenges to modern VLMs. Unlike other safety-tuned VLMs (VLGuard and SPA-VL) susceptible to jailbreaks and unsafe responses, SafeLLaVA-7B robustly defends against such attacks.

Despite emerging efforts to enhance the safety of Vision-Language Models (VLMs), current approaches face two main shortcomings. 1) Existing safety-tuning datasets and benchmarks only partially consider how image-text interactions can yield harmful content, often overlooking contextually unsafe outcomes from seemingly benign pairs. This narrow coverage leaves VLMs vulnerable to jailbreak attacks in unseen configurations. 2) Prior methods rely primarily on data-centric tuning, with limited architectural innovations to intrinsically strengthen safety. We address these gaps by introducing a holistic safety dataset and benchmark, HoliSafe, that spans all five safe/unsafe image-text combinations, providing a more robust basis for both training and evaluation (HoliSafe-Bench). We further propose a novel modular framework for enhancing VLM safety with a visual guard module (VGM) designed to assess the harmfulness of input images for VLMs. This module endows VLMs with a dual functionality: they not only learn to generate safer responses but can also provide an interpretable harmfulness classification to justify their refusal decisions. A significant advantage of this approach is its modularity; the VGM is designed as a plug-in component, allowing for seamless integration with diverse pre-trained VLMs across various scales. Experiments show that Safe-VLM with VGM, trained on our HoliSafe, achieves state-of-the-art safety performance across multiple VLM benchmarks. Additionally, the HoliSafe-Bench itself reveals critical vulnerabilities in existing VLM models. We hope that HoliSafe and VGM will spur further research into robust and interpretable VLM safety, expanding future avenues for multimodal alignment.

HoliSafe: Safety-tuning Dataset & Benchmark

Overview

Unlike prior works that cover only a subset of image-text safeness combinations (e.g., unsafe image with safe text), we introduce a new holistic safety-tuning dataset and benchmark, called HoliSafe, that systematically covers all five image-text safeness combinations: (1) unsafe image + unsafe text (U_iU_t), (2) unsafe image + safe text (U_iS_t), (3) safe image + unsafe text (S_iU_t), (4) safe image + safe text yielding unsafe content (S_iS_t → U), and (5) safe image + safe text yielding safe content (S_iS_t → S).

Statistics

HoliSafe defines a safety taxonomy with 7 main categories with 18 subcategories which are commonly encountered in real-world scenarios. We collect a total of 6,782 images and 15,114 instruction-response pairs. We split the dataset into a training set, 4,983 (73.15%) images, for safetytuning and a test set, 1,796 (26.85%) for Holisafe-Bench. Training and Test splits have 10,215 and 4,031 instruction-response pairs, respectively.

Safety-Tuned VLMs with Visual Guard Module

SafeLLaVA

We propose a novel modular framework centered on a Visual Guard Module (VGM), a lightweight classifier designed to detect visually harmful content. The VGM operates on visual tokens processed and contextualized by the vision encoder and LLM. It pools these final-layer visual tokens into a single global vector, which then serves as the input for classifying the image's harmfulness. This approach yields significant advantages: it enables a dual functionality where the model can simultaneously generate safe responses and perform harmfulness classification; it enhances interpretability by enabling the model to explicitly justify its safety refusals; and its modularity allows for seamless integration into diverse VLMs, as demonstrated by our Safe-VLM series.

Leaderboard on HoliSafe-Bench

Inspired by the LLM-as-judge paradigm, we employ three closed-weight, proprietary AI models, e.g., Claude-3.5-Sonnet, GPT-4o and Gemini-2.0-Flash, as automated safety judges. Since our HoliSafe dataset's instruction–response pairs were generated using GPT-4o, employing diverse judges rather than relying solely on GPT-4o was crucial to mitigate evaluation bias and ensure robust, impartial assessment. In addition, we employ automatic string matching evaluation widely used in LLM benchmarks, by classifying any response to a harmful request into safe if the response contains one of the predefined refusal responses, e.g., "I can't assist". We quantify performance using Attack Success Rate (ASR)—the fraction of unsafe inputs for which a VLM erroneously complies—and Refusal Rate (RR), defined as one minus the proportion of safe inputs for which the VLM correctly provides a helpful answer. mASR denotes mean Attack Success Rate across all unsafe types.

We extensively benchmark 17 VLM models, including both open-source and proprietary models, on our HoliSafe-Bench using three proprietary AI models as judges. Purple denotes open-weight VLMs, green means closed-weight VLMs, and red means safety-tuned VLMs. The best-performing model in each category is in-bold, and the second best is underlined.

💡 Key Empirical Insights

1. Unsafe images pose greater risks than unsafe text: Analysis shows that U_IS_T scenarios consistently yield higher ASRs compared to U_IU_T and S_IU_T conditions across all models and judges, indicating VLMs' heightened vulnerability to unsafe visual inputs.

2. Open-weight VLMs show highest vulnerability: These models exhibit the highest ASRs (52-79%) with refusal rates of 0.3-1.6% on safe inputs, demonstrating significant safety challenges.

3. Closed-weight VLMs achieve moderate safety: While showing improved safety (e.g., Claude-3.5-Sonnet), these models still face challenges with ASRs up to 67% under certain judges, though maintaining low refusal rates (0-1.2%).

4. Safety-tuned VLMs achieve the lowest ASRs overall, albeit with modestly higher refusal rates.: Safety-tuned methods VLGuard and SPA-VL exhibit lower mASR compared to the open-weight model, but show varying ASR against the closed-weight model and do not consistently achieve the lowest rate. However, our Safe-VLM models, trained on HoliSafe, showcase lower ASRs below 10% under Claude and below 16% under GPT/Gemini; in particular, Safe-LLaVA-7B achieves lower mASR with similar RR than counterparts, VLGuard-7B and SPA-VL-7B, by large margins. Furthermore, Safe-Qwen2.5-VL-32B achieves the lowest ASRs under all judges. However, all safety-tuned models show slightly increased refusal rates compared to open and closed weight models.

5. Judge consistency in model ranking: While absolute metrics vary by judge, the relative safety ranking (open-weight ≫ closed-weight ≫ safety-tuned) remains consistent across all evaluation methods.

6. Strong correlation with string matching: Automatic string matching shows high correlation with AI judges (ρ=0.99 with GPT-4o/Gemini), suggesting its viability as a cost-effective safety evaluation method.

Other Results

Our SafeLLaVA outperforms other safety-tuned VLMs such as VLGuard and SPA-VL based on the same LLaVA-v1.5 backbone on different VLM safety benchmarks.

We evaluate our framework's effectiveness by comparing our Safe-VLM series against their baselines on the safety-utility trade-off. For this analysis, the safety rate is computed as 1 minus the mean attack success rate. The results demonstrate a dramatic improvement in safety across all models and scales; our Safe-VLM series consistently achieves a safety rate exceeding 91%, a substantial leap from the baselines' 21-48% range. Critically, this significant safety enhancement is achieved with a minimal impact on utility, as Helpfulness scores decrease by a negligible 0-1.2 percentage points. This outcome validates that our modular approach effectively enhances VLM safety without sacrificing core instruction-following capabilities, thus achieving a highly favorable safety-utility balance. Furthermore, compared to Guard models such as LLaMA-Guard4-12B, LLaMA-Guard4-12B, LLaMA-Guard3-11B-Vision, LLaVAGuard-7B and ShieldGemma2-4B-IT our Safe-VLM with VGM excels in guard-style classification accuracy as well as critically maintains its robust instruction-following VLM capabilities. This unique duality allows it to both generate safe responses and provide explicit input safety classifications, offering vital interpretability and effectively bridging the gap between pure safety classifiers and safe vision-language instruction models.

Further Analysis

Qualitative Comparisons

BibTeX


      @article{lee2025holisafe,
        title={HoliSafe: Holistic Safety Benchmarking and Modeling for Vision-Language Model},
        author={Lee, Youngwan and Kim, Kangsan and Park, Kwanyong and Jung, Ilcahe and Jang, Soojin and Lee, Seanie and Lee, Yong-Ju and Hwang, Sung Ju},
        journal={arXiv preprint arXiv:2506.04704},
        year={2025},
        url={https://arxiv.org/abs/2506.04704},
        archivePrefix={arXiv},
        eprint={2506.04704},
        primaryClass={cs.AI},
      }