Logo Video-MME

The First-Ever Comprehensive Evaluation Benchmark of
Multi-modal LLMs in Video Analysis

Introduction

In the quest for artificial general intelligence, Multi-modal Large Language Models (MLLMs) have emerged as a focal point in recent advancements. However, the predominant focus remains on developing their capabilities in static image understanding. The potential of MLLMs in processing sequential visual data is still insufficiently explored, highlighting the absence of a comprehensive, high-quality assessment of their performance. In this paper, we introduce Video-MME, the first-ever full-spectrum, Multi-Modal Evaluation benchmark of MLLMs in Video analysis. Our work distinguishes from existing benchmarks through four key features: 1) Diversity in video types, spanning 6 primary visual domains with 30 subfields to ensure broad scenario generalizability; 2) Duration in temporal dimension, encompassing both short-, medium-, and long-term videos, ranging from 11 seconds to 1 hour, for robust contextual dynamics; 3) Breadth in data modalities, integrating multi-modal inputs besides video frames, including subtitles and audios, to unveil the all-round capabilities of MLLMs; 4) Quality in annotations, utilizing rigorous manual labeling by expert annotators to facilitate precise and reliable model assessment. 900 videos with a total of 254 hours are manually selected and annotated by repeatedly viewing all the video content, resulting in 2,700 question-answer pairs. With Video-MME, we extensively evaluate various state-of-the-art MLLMs, including GPT-4 series and Gemini 1.5 Pro, as well as open-source image models like InternVL-Chat-V1.5 and video models like LLaVA-NeXT-Video. Our experiments reveal that Gemini 1.5 Pro is the best-performing commercial model, significantly outperforming the open-source models with an average accuracy of 75%, compared to 71.9% for GPT-4o. The results also demonstrate that Video-MME is a universal benchmark, which applies to both image and video MLLMs. Further analysis indicates that subtitle and audio information could significantly enhance video understanding. Besides, a decline in MLLM performance is observed as video duration increases for all models. Our dataset along with these findings underscores the need for further improvements in handling longer sequences and multi-modal data, shedding light on future MLLM development.

Leaderboard

Accuracy scores on Video-MME are presented for short, medium, and long videos, taking the corresponding subtitles as input or not.

Short Video: < 2min          Medium Video: 4min ~ 15min          Long Video: 30min ~ 60min

By default, this leaderboard is sorted by results with subtitles. To view other sorted results, please click on the corresponding cell.

# Model LLM
Params
Frames Date Overall (%) Short Video (%) Medium Video (%) Long Video (%)
w/o subs w subs w/o subs w subs w/o subs w subs w/o subs w subs
Gemini 1.5 Pro

Google

- 1/0.5 fps1* 2024-06-15 75.0 81.3 81.7 84.5 74.3 81.0 67.4 77.4
Qwen2-VL

Alibaba

72B 7683* 2024-08-19 71.2 77.8 80.1 82.2 71.3 76.8 62.2 74.3
GPT-4o

OpenAI

- 3842* 2024-06-15 71.9 77.2 80.0 82.8 70.3 76.6 65.3 72.1
LLaVA-NeXT-Video

Bytedance & NTU S-Lab

72B 64 2024-08-28 70.6 76.9 81.4 82.8 68.9 75.6 61.5 72.5
Gemini 1.5 Flash

Google

- 1/0.5 fps1* 2024-06-15 70.3 75.0 78.8 79.8 68.8 74.7 61.1 68.8
LLaVA-OneVision

Bytedance & NTU S-Lab

72B 32 2024-08-08 66.3 69.6 76.7 79.3 62.2 66.9 60.0 62.4
GPT-4o mini

OpenAI

- 250 2024-07-21 64.8 68.9 72.5 74.9 63.1 68.3 58.6 63.4
Oryx

THU & Tencent & NTU

34B 64 2024-09-24 63.2 67.6 72.9 76.1 62.8 68.5 54.0 58.2
VideoLLaMA 2

Alibaba

72B 32 2024-08-29 62.4 64.7 69.8 72.0 59.9 63.0 57.6 59.0
VILA-1.5

NVIDIA & MIT

34B 14 2024-07-21 62.3 64.1 72.0 74.0 61.2 62.6 53.8 55.7
MiniCPM-V 2.6

OpenBMB

8B 64 2024-08-12 60.9 63.7 71.3 73.5 59.4 61.1 51.8 56.3
GPT-4V

OpenAI

- 10 2024-06-15 59.9 63.3 70.5 73.2 55.8 59.7 53.5 56.9
Claude 3.5 Sonnet

Anthropic

- 20 2024-07-30 60.0 62.9 71.0 73.5 57.4 60.1 51.2 54.7
InternVL2

Shanghai AI Lab

34B 16 2024-07-18 61.2 62.4 72.0 72.8 59.1 61.3 52.6 53.0
VITA

Tencent Youtu Lab & NJU

8×7B 32 2024-09-08 55.8 59.2 65.9 70.4 52.9 56.2 48.6 50.9
Kangaroo

Meituan & UCAS

8B 64 2024-07-23 56.0 57.6 66.1 68.0 55.3 55.4 46.6 49.3
Video-CCAM

QQMM

14B 96 2024-07-16 53.2 57.4 62.2 66.0 50.6 56.3 46.7 49.9
Long-LLaVA

Amazon

7B 64 2024-09-09 52.9 57.1 61.9 66.2 51.4 54.7 45.4 50.3
LongVA

NTU S-Lab

7B 128 2024-06-25 52.6 54.3 61.1 61.6 50.4 53.6 46.2 47.6
InternVL-Chat-V1.5

Shanghai AI Lab

20B 10 2024-06-15 50.7 52.4 60.2 61.7 46.4 49.1 45.6 46.6
Qwen-VL-Max

Alibaba

- 4 2024-06-15 51.3 51.2 55.8 57.6 49.2 48.9 48.9 47.0
ShareGemini

XMU

7B 64 2024-06-20 43.2 47.9 49.1 52.8 41.3 47.3 39.1 43.4
SliME

CASIA

8B 8 2024-07-16 45.3 47.2 53.3 55.4 42.7 44.4 39.8 41.7
Chat-UniVi-v1.5

PKU

7B 64 2024-06-15 40.6 45.9 45.7 51.2 40.3 44.6 35.8 41.8
VideoChat2-Mistral

Shanghai AI Lab

7B 16 2024-06-15 39.5 43.8 48.3 52.8 37.0 39.4 33.2 39.2
ShareGPT4Video

Shanghai AI Lab

8B 16 2024-06-17 39.9 43.6 48.3 53.6 36.3 39.3 35.0 37.9
ST-LLM

PKU

7B 64 2024-06-15 37.9 42.3 45.7 48.4 36.8 41.4 31.3 36.9
Qwen-VL-Chat

Alibaba

7B 4 2024-06-15 41.1 41.9 46.9 47.3 38.7 40.4 37.8 37.9
Video-LLaVA

PKU

7B 8 2024-06-15 39.9 41.6 45.3 46.1 38.0 40.7 36.2 38.1

Green date indicates the newly added/updated models          - indicates closed-source models

1* The short and medium videos are sampled at 1 fps, while the long videos are sampled at 0.5 fps to ensure the stability of the API.
2* The videos less than 384 seconds are sampled at 1 fps, and for those longer than 384 seconds, we extract 384 frames uniformly. All the frames are resized to 512x512 resolution to fit within GPT-4o’s max context length.
3* The videos are sampled at 2 fps, and the upper limit is 768 frames.

Benchmark

Data Examples

All data are newly collected and annotated by humans, not from any existing video dataset.

Benchmark Statistics

data-composition

(Left) Video Categorie Hierarchy: Video-MME consists of 6 key domains and 30 subcategories of video types.
(Right) Video Duration and Task Type Distributions: Video-MME spans a full spectrum of video lengths and assesses various core abilities of MLLMs.

Benchmark Comparison

data-composition

Analysis of Certificate Length in seconds. Avg. V.L.: average video length, Med. C.L.: median certificate length, Avg. C.L.: average certificate length.

data-composition

The comparison of various benchmarks encompasses several key aspects: the total number of videos, the number of clips, the average duration of the videos, the method of video annotation (manual denoted as M, automated as A), the average number of QA pair tokens, the average number of subtitle tokens, whether the videos cover multiple duration levels, whether the videos are sourced from a broad range of open domains, and whether provide subtitle together with audio information. It is important to note that if a dataset includes multiple task formats, our comparison focuses solely on the multiple-choice segment.

Experiment Results

Different Question Types

grade-lv

Evaluation results of four representative MLLMs.

Different Video Duration Types

Evaluation results of Gemini 1.5 Pro.

Citation


    @article{fu2024video,
      title={Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis},
      author={Fu, Chaoyou and Dai, Yuhan and Luo, Yondong and Li, Lei and Ren, Shuhuai and Zhang, Renrui and Wang, Zihan and Zhou, Chenyu and Shen, Yunhang and Zhang, Mengdan and others},
      journal={arXiv preprint arXiv:2405.21075},
      year={2024}
    }