VMDT: Decoding the Trustworthiness of Video Foundation Models

Yujin Potter1*, Zhun Wang1*, Nicholas Crispino2*, Kyle Montgomery2*, Alexander Xiong1*, Ethan Y. Chang3, Francesco Pinto4, Yuqi Chen2, Rahul Gupta5, Morteza Ziyadi5, Christos Christodoulopoulos6, Bo Li4, Chenguang Wang2, Dawn Song1
1 University of California, Berkeley
2 University of California, Santa Cruz
3 University of Illinois at Urbana-Champaign
4 University of Chicago
5 Amazon
6 Information Commissioner's Office

NeurIPS 2025 Datasets & Benchmarks

*Indicates Lead Authors

Abstract

As foundation models become more sophisticated, ensuring their trustworthiness becomes increasingly critical; yet, unlike text and image, the video modality still lacks comprehensive trustworthiness benchmarks. We introduce VMDT (Video-Modal DecodingTrust), the first unified platform for evaluating text-to-video (T2V) and video-to-text (V2T) models across five key trustworthiness dimensions: safety, hallucination, fairness, privacy, and adversarial robustness. Through our extensive evaluation of 7 T2V models and 19 V2T models using VMDT, we uncover several significant insights. For instance, all open-source T2V models evaluated fail to recognize harmful queries and often generate harmful videos, while exhibiting higher levels of unfairness compared to image modality models. In V2T models, unfairness and privacy risks rise with scale, whereas hallucination and adversarial robustness improve—though overall performance remains low. Uniquely, safety shows no correlation with model size, implying that factors other than scale govern current safety levels. Our findings highlight the urgent need for developing more robust and trustworthy video foundation models, and VMDT provides a systematic framework for measuring and tracking progress toward this goal.

Examples

Examples

Examples of untrustworthy model responses for each perspective

Each Aspect and Findings

Leaderboard

Trustworthiness profiles for each model across five perspectives. (Best average in blue, worst in red where indicated.)
Model Type Model Family Model Name Safety Hallucination Fairness Privacy Adv Robustness Average
T2V VideoCrafter VideoCrafter2 76.235.760.565.750.957.8
CogVideoX CogVideoX-5B 64.137.859.5 66.250.955.7
OpenSora OpenSora 1.2 66.837.154.665.859.356.72
Vchitect Vchitect-2.0 67.449.062.265.864.061.68
Luma Luma 83.367.657.3 65.177.270.1
Pika Pika 59.963.051.8464.471.062.028
Nova Nova Reel 90.445.962.0666.475.468.032
V2T InternVL2.5 InternVL2.5-1B 54.138.282.089.172.167.1
InternVL2.5-2B 50.647.376.782.677.366.9
InternVL2.5-4B 46.152.376.082.480.567.46
InternVL2.5-8B 47.551.882.177.984.468.74
InternVL2.5-26B 50.660.478.675.586.270.26
InternVL2.5-38B 47.964.479.273.689.670.94
InternVL2.5-78B 52.766.084.5 69.491.172.74
Qwen2.5-VL Qwen2.5-VL-3B-Instruct 52.047.078.0 75.074.765.34
Qwen2.5-VL-7B-Instruct 64.048.581.471.474.467.94
Qwen2.5-VL-72B-Instruct 53.257.285.557.079.166.4
VideoLLaMA2 VideoLLaMA2.1-7B 52.638.380.685.071.765.64
VideoLLaMA2-72B 51.846.479.4100.077.471.0
LLaVA-Video LLaVA-Video-7B-Qwen2 49.143.982.390.767.766.74
LLaVA-Video-72B-Qwen2 48.951.186.679.576.568.52
GPT GPT-4o-mini 80.950.479.461.675.469.54
GPT-4o 86.557.786.539.480.670.14
Claude Claude-3.5-Sonnet 98.653.383.642.671.769.96
Nova Nova Lite 76.541.277.763.668.565.5
Nova Pro 78.743.678.585.071.071.36

BibTeX

@article{potter2025vmdt,
  title={VMDT: Decoding the Trustworthiness of Video Foundation Models},
  author = {Potter, Yujin and Wang, Zhun and Crispino, Nicholas and Montgomery, Kyle and Xiong, Alexander and Chang, Ethan Y. and Pinto, Francesco and Chen, Yuqi and Gupta, Rahul and Ziyadi, Morteza and Christodoulopoulos, Christos and Li, Bo and Wang, Chenguang and Song, Dawn},
  journal={Advances in Neural Information Processing Systems},
  year={2025}
}