VMDT: Decoding the Trustworthiness of Video Foundation Models
Abstract
As foundation models become more sophisticated, ensuring their trustworthiness becomes increasingly critical; yet, unlike text and image, the video modality still lacks comprehensive trustworthiness benchmarks. We introduce VMDT (Video-Modal DecodingTrust), the first unified platform for evaluating text-to-video (T2V) and video-to-text (V2T) models across five key trustworthiness dimensions: safety, hallucination, fairness, privacy, and adversarial robustness. Through our extensive evaluation of 7 T2V models and 19 V2T models using VMDT, we uncover several significant insights. For instance, all open-source T2V models evaluated fail to recognize harmful queries and often generate harmful videos, while exhibiting higher levels of unfairness compared to image modality models. In V2T models, unfairness and privacy risks rise with scale, whereas hallucination and adversarial robustness improve—though overall performance remains low. Uniquely, safety shows no correlation with model size, implying that factors other than scale govern current safety levels. Our findings highlight the urgent need for developing more robust and trustworthy video foundation models, and VMDT provides a systematic framework for measuring and tracking progress toward this goal.
Examples
Examples of untrustworthy model responses for each perspective
Each Aspect and Findings
Safety
Average harmful content generation rate (HGR) for evaluating the safety of V2T models. Different model families are represented by distinct colors.
We construct a comprehensive safety evaluation dataset comprising 780 prompts for T2V models and 990 prompts for V2T models, spanning 13 and 27 risk categories, respectively. Our risk taxonomy is grounded in established industry policies and benchmarks, while also addressing the unique characteristics of the video modality such as temporal and physical harm risks that cannot be detected in static frames. We design novel scenarios to test the performance of models under diverse conditions, including transformed instructions, synthetic content, and real-world content. Our evaluation and analysis reveal several critical findings: (1) Open-source T2V models universally lack safety refusal mechanisms, while even closed-source models struggle with video-specific risks like temporal and physical harm. (2) T2V models generate less harmful content in transformed scenarios, likely reflecting capability limitations rather than improved safety. (3) A substantial safety gap exists between open and closed-source V2T models, with all open-source variants demonstrating significantly higher harmful content generation rates. (4) Closed-source V2T models like Claude and GPT exhibit better safety overall but remain particularly vulnerable to fraud and deception risks, highlighting critical alignment gaps across all VFMs.
Hallucination
We construct a diverse dataset comprising 1,650 prompts for T2V models and 1,218 prompts for V2T models with the aim of measuring hallucination under different scenarios. Our hallucination dataset incorporates various scenarios including naturally difficult prompts, distraction, misleading, counterfactual reasoning, temporal activities, and OCR. We evaluate these scenarios across a set of tasks focusing on objects, attributes, actions, counting, and spatial understanding, as well as a scene understanding task for V2T. Our analysis reveals the following: 1) For T2V, all evaluated open-source models hallucinate significantly more than closed-source models, Luma and Pika, in nearly all scenarios. 2) Object recognition is the easiest task for T2V models, while OCR presents one of the most challenging scenarios. This aligns with the hallucination results observed in text-to-image (T2I) models, suggesting that T2V and T2I models share common challenges. 3) Within the same model class, an increase in V2T model size is associated with a decrease in hallucination. 4) For V2T models, we find the best-performing model on average is InternVL2.5-78B, an open-source model, which is the opposite of what is seen in T2V models.
Fairness
We construct an extensive dataset comprising 1,086 prompts for T2V models and 5,008 prompts for V2T models. This fairness dataset aims to assess model fairness across various contexts, including social stereotypes (e.g., occupation) and decision-making scenarios (e.g., hiring). We also examine "overkill fairness," where models sacrifice factual/historical accuracy in pursuit of diversity (e.g., generating videos of Black female Founding Fathers). Our evaluation reveals several significant findings: 1) T2V models exhibit substantial overrepresentation towards males, White individuals, and younger people, while demonstrating some degree of overkill fairness. 2) This overrepresentation surpasses that of T2I models, yet shows lower levels of overkill fairness, suggesting a trade-off between these two dimensions. 3) V2T model fairness demonstrates a significant negative correlation with model size, with larger models exhibiting increased unfairness. 4) All V2T models show significant overkill fairness, generating historically inaccurate outputs to promote diversity.
Privacy
We construct a balanced dataset of 1,000 text prompts for T2V models and 200 video samples for V2T models to evaluate the privacy memorization and extractive capabilities, respectively. Our T2V dataset comprises text prompts sampled from a pretraining corpus (i.e., caption-video pairs) used for most contemporary T2V models. Our V2T dataset comprises driving scene videos along with their location information (e.g., zip code) to evaluate inference capabilities to predict sensitive location data. Our evaluation results reveal the following: 1) T2V models generally exhibit weak data memorization. 2) However, we observe that the T2V VideoCrafter2 model sometimes includes watermarks from copyrighted training data in its generated videos, indicating some level of data memorization does occur. 3) Larger V2T models tend to demonstrate stronger location inference, suggesting that privacy risks increase as model size increases.
Adversarial Robustness
We construct a challenging dataset to assess the robustness of T2V and V2T models to adversarial inputs. Our dataset contains 329 prompts for T2V models and 1,523 prompts for V2T models across five tasks: action recognition, attribute recognition, counting, object recognition, and spatial understanding. By attacking selected T2V and V2T surrogate models, we adversarially optimize the inputs. Our findings reveal several important insights.
1) Both T2V and V2T models are vulnerable to adversarial inputs.
2) Among our five tasks, counting and spatial understanding pose the greatest challenge for both T2V and V2T models.
3) The performance gap between open and closed-source T2V models is larger than that of V2T models.
4) Within the same V2T model class, larger models generally demonstrate greater robustness to adversarial inputs than their smaller counterparts.
Leaderboard
| Model Type | Model Family | Model Name | Safety | Hallucination | Fairness | Privacy | Adv Robustness | Average |
|---|---|---|---|---|---|---|---|---|
| T2V | VideoCrafter | VideoCrafter2 | 76.2 | 35.7 | 60.5 | 65.7 | 50.9 | 57.8 |
| CogVideoX | CogVideoX-5B | 64.1 | 37.8 | 59.5 | 66.2 | 50.9 | 55.7 | |
| OpenSora | OpenSora 1.2 | 66.8 | 37.1 | 54.6 | 65.8 | 59.3 | 56.72 | |
| Vchitect | Vchitect-2.0 | 67.4 | 49.0 | 62.2 | 65.8 | 64.0 | 61.68 | |
| Luma | Luma | 83.3 | 67.6 | 57.3 | 65.1 | 77.2 | 70.1 | |
| Pika | Pika | 59.9 | 63.0 | 51.84 | 64.4 | 71.0 | 62.028 | |
| Nova | Nova Reel | 90.4 | 45.9 | 62.06 | 66.4 | 75.4 | 68.032 | |
| V2T | InternVL2.5 | InternVL2.5-1B | 54.1 | 38.2 | 82.0 | 89.1 | 72.1 | 67.1 |
| InternVL2.5-2B | 50.6 | 47.3 | 76.7 | 82.6 | 77.3 | 66.9 | ||
| InternVL2.5-4B | 46.1 | 52.3 | 76.0 | 82.4 | 80.5 | 67.46 | ||
| InternVL2.5-8B | 47.5 | 51.8 | 82.1 | 77.9 | 84.4 | 68.74 | ||
| InternVL2.5-26B | 50.6 | 60.4 | 78.6 | 75.5 | 86.2 | 70.26 | ||
| InternVL2.5-38B | 47.9 | 64.4 | 79.2 | 73.6 | 89.6 | 70.94 | ||
| InternVL2.5-78B | 52.7 | 66.0 | 84.5 | 69.4 | 91.1 | 72.74 | ||
| Qwen2.5-VL | Qwen2.5-VL-3B-Instruct | 52.0 | 47.0 | 78.0 | 75.0 | 74.7 | 65.34 | |
| Qwen2.5-VL-7B-Instruct | 64.0 | 48.5 | 81.4 | 71.4 | 74.4 | 67.94 | ||
| Qwen2.5-VL-72B-Instruct | 53.2 | 57.2 | 85.5 | 57.0 | 79.1 | 66.4 | ||
| VideoLLaMA2 | VideoLLaMA2.1-7B | 52.6 | 38.3 | 80.6 | 85.0 | 71.7 | 65.64 | |
| VideoLLaMA2-72B | 51.8 | 46.4 | 79.4 | 100.0 | 77.4 | 71.0 | ||
| LLaVA-Video | LLaVA-Video-7B-Qwen2 | 49.1 | 43.9 | 82.3 | 90.7 | 67.7 | 66.74 | |
| LLaVA-Video-72B-Qwen2 | 48.9 | 51.1 | 86.6 | 79.5 | 76.5 | 68.52 | ||
| GPT | GPT-4o-mini | 80.9 | 50.4 | 79.4 | 61.6 | 75.4 | 69.54 | |
| GPT-4o | 86.5 | 57.7 | 86.5 | 39.4 | 80.6 | 70.14 | ||
| Claude | Claude-3.5-Sonnet | 98.6 | 53.3 | 83.6 | 42.6 | 71.7 | 69.96 | |
| Nova | Nova Lite | 76.5 | 41.2 | 77.7 | 63.6 | 68.5 | 65.5 | |
| Nova Pro | 78.7 | 43.6 | 78.5 | 85.0 | 71.0 | 71.36 |
BibTeX
@article{potter2025vmdt,
title={VMDT: Decoding the Trustworthiness of Video Foundation Models},
author = {Potter, Yujin and Wang, Zhun and Crispino, Nicholas and Montgomery, Kyle and Xiong, Alexander and Chang, Ethan Y. and Pinto, Francesco and Chen, Yuqi and Gupta, Rahul and Ziyadi, Morteza and Christodoulopoulos, Christos and Li, Bo and Wang, Chenguang and Song, Dawn},
journal={Advances in Neural Information Processing Systems},
year={2025}
}