💡
Key Empirical Insights
1. Unsafe images pose greater risks than unsafe text: Analysis shows that UIST scenarios consistently yield higher ASRs compared to UIUT and SIUT conditions across all models and judges, indicating VLMs' heightened vulnerability to unsafe visual inputs.
2. Open-weight VLMs
show highest vulnerability: These models exhibit the highest ASRs
(52-79%) with refusal rates of 0.3-1.6% on safe inputs, demonstrating
significant safety challenges.
3. Closed-weight VLMs
achieve moderate safety: While showing improved safety (e.g.,
Claude-3.5-Sonnet), these models still face challenges with ASRs up to 67% under
certain judges, though maintaining low refusal rates (0-1.2%).
4. Safety-tuned VLMs
achieve the lowest ASRs overall,
albeit with modestly higher refusal rates.: Safety-tuned methods VLGuard and SPA-VL exhibit lower mASR compared to the open-weight model, but show varying ASR against the closed-weight model and do not consistently achieve the lowest rate.
However, our Safe-VLM models, trained on HoliSafe, showcase lower ASRs below 10% under Claude and below 16% under GPT/Gemini; in particular, Safe-LLaVA-7B achieves lower mASR with similar RR than counterparts, VLGuard-7B and SPA-VL-7B, by large margins. Furthermore, Safe-Qwen2.5-VL-32B achieves the lowest ASRs under all judges.
However, all safety-tuned models show slightly increased refusal rates compared to open and closed weight models.
5. Judge consistency in model ranking: While absolute metrics vary by
judge, the relative safety ranking (open-weight ≫ closed-weight ≫ safety-tuned)
remains consistent across all evaluation methods.
6. Strong correlation with string matching: Automatic string matching
shows high correlation with AI judges (ρ=0.99 with GPT-4o/Gemini), suggesting
its viability as a cost-effective safety evaluation method.