MEGA-Bench Leaderboard

๐Ÿš€ Introduction

MEGA-Bench is a comprehensive benchmark scaling multimodal evaluation to 500+ real-world tasks!

We aim to provide cost-effective and accurate evaluation for multimodal models, covering a wide range of real-world tasks. You don't have to run models on dozens of benchmarks -- MEGA-Bench delivers a comprehensive performance report in a single benchmark.

๐Ÿง Highlights of MEGA-Bench

  • 505 diverse tasks evaluating multimodal models across 8 grand application types, 7 input visual formats, 6 output formats, and 10 general multimodal skills, covering single-image, multi-image, and video tasks
  • Moves beyond multiple-choice questions, offering diverse output formats like numbers, code, LATEX, phrases, free-form responses, and more. We developed 45 customized metrics to accurately evaluate these diverse outputs
  • Focuses on task diversity rather than repetitive examples, ensuring cost-efficient evaluation
  • Provides fine-grained capability reports across application type, input/output formats, and required skills

๐Ÿ”จ Systematic Annotation Process

  • Guided by an initial application-driven taxonomy tree
  • 16 expert annotators contributing to a 2-round process to develop 505 tasks
  • Utilizes advanced tools for task design, review, and quality control
  • Ensures high-quality data through continuous refinement and balanced task distribution

๐Ÿ“Š๐Ÿ” Results & Takeaways from Evaluating Top Models

๏ธโ€๐Ÿ”ฅ๐Ÿ“ 2025.01

  • Gemini 2.0 Experimental (1206) and Gemini 2.0 Flash Experimental outperform GPT-4o and Claude 3.5 Sonnet.
  • We add Grok-2-vision-1212 to the single-image leaderboard. The model seems to use a lot of tokens per image, and cannot run many of our multi-image and video tasks.
  • We will evaluate o1 series models when there is budget.

๐Ÿ“ 2024.11

  • GPT-4o (0513) and Claude 3.5 Sonnet (1022) lead the benchmark. Claude 3.5 Sonnet (1022) improves over Claude 3.5 Sonnet (0620) obviously in planning tasks (application dimension) and UI/Infographics inputs (input format dimension).
  • Qwen2-VL stands out among open-source models, and its flagship model gets close to some proprietary flagship models
  • Chain-of-Thought (CoT) prompting improves proprietary models but has limited impact on open-source models
  • Gemini 1.5 Flash performs the best among all the evaluated efficiency models, but struggles with UI and document tasks
  • Many open-source models face challenges in adhering to output format instructions

๐ŸŽฏ Interactive Visualization

Visit our project page to explore the interactive task taxonomy and radar maps, offering deep insights into model capabilities across multiple dimensions. Discover a comprehensive breakdown far beyond single-score evaluations.

๐Ÿ“š More Information

Table 1: MEGA-Bench full results. The number in the parentheses is the number of tasks of each keyword.
The Core set contains $N_{\text{core}} = 440$ tasks evaluated by rule-based metrics, and the Open-ended set contains $N_{\text{open}} = 65$ tasks evaluated by a VLM judge (we use GPT-4o-0806).
Different from the results in our paper, we only use the Core results with CoT prompting here for clarity and compatibility with the released data.
$\text{Overall} \ = \ \frac{\text{Core} \ \cdot \ N_{\text{core}} \ + \ \text{Open-ended} \ \cdot \ N_{\text{open}}}{N_{\text{core}} \ + \ N_{\text{open}}}$
* indicates self-reported results from the model authors.

Select a dimension to display breakdown results. We use different column colors to distinguish the overall benchmark scores and breakdown results.
Select a model group
10
58.00*
57.12
64.56
58.82
64.68
40.44
66.94
53.27
57.89
58.84
64.63