FlagEval-EmbodiedVerse

欢迎使用FlagEval-EmbodiedVerse！ FlagEval-EmbodiedVerse 旨在通过FlagEval具身工具链跟踪、排名和评估具身大模型（Embodied model），其中FlagEvalMM提供了多模态评估架构，EmbodiedVerse构建了一种基于具身智能高质量评测数据集的能力体系，Leaderboard则通过榜单实时跟踪并呈现不同具身大模型综合能力。

Welcome to the FlagEval-EmbodiedVerse! FlagEval-EmbodiedVerse aims to track, rank, and evaluate embodied models through the FlagEval embodied toolchain. FlagEvalMM provides a multimodal evaluation framework, while EmbodiedVerse builds a capability system based on high-quality evaluation datasets for embodied intelligence. The Leaderboard tracks and presents the comprehensive capabilities of different embodied large models in real time through a leaderboard.


?	Mistral-Small-3.1-24B-Instruct-2503	61.18	42.38	68.21	84.59	59.87	78.74	51.78	47.81	79.33	42.85	56.25	Mistral-Small-3.1-24B-Instruct-2503


?	gemini-2.5-pro-preview-05-06	61.18	42.38	68.21	84.59	59.87	78.74	51.78	47.81	79.33	42.85	56.25	gemini-2.5-pro-preview-05-06
?	o4-mini-2025-04-16	59.18	26.59	77.31	85.21	51.25	78.29	60.18	41.96	82	41.11	47.87	o4-mini-2025-04-16
?	InternVL3-78B	58.63	19.71	77.75	85.69	51.49	81.21	55.63	47.07	66	57.76	44	InternVL3-78B
?	gemini-2.5-flash-preview-04-17	58.17	28.6	71.68	84.03	54.1	74.75	56.66	48.83	74	37.09	52	gemini-2.5-flash-preview-04-17
?	Qwen2.5-VL-32B-Instruct	57.05	22.52	71.39	81.59	52.16	74.45	53.61	36.07	80	56.25	42.5	Qwen2.5-VL-32B-Instruct
?	gemini-2.0-pro-exp-02-05	56.36	37.51	60.55	82.85	58.68	75.38	59.43	46.96	57.33	43.3	41.6	gemini-2.0-pro-exp-02-05
?	Qwen2.5-VL-72B-Instruct	55.73	39.92	72.11	82.68	48.33	73.3	54.64	35.51	58.67	53.75	38.35	Qwen2.5-VL-72B-Instruct
?	claude-sonnet-4-20250514-official	55.32	25.63	72.98	78.43	51.26	64.26	51.31	47.02	75.33	41.26	45.75	claude-sonnet-4-20250514-official
?	gpt-4o-2024-11-20	54.2	20.41	73.55	78.63	44.42	71.92	54.27	43.6	66.67	41.79	46.75	gpt-4o-2024-11-20
?	InternVL2_5-38B	53.22	14.96	70.66	82.07	42.35	78.96	53.28	41.45	68	40.2	40.25	InternVL2_5-38B
?	InternVL3-8B	51.79	9.19	66.91	83.06	44.36	74.89	49.77	40.99	63.33	42.92	42.5	InternVL3-8B
?	Mistral-Small-3.1-24B-Instruct-2503	48.82	16.28	67.49	76.85	47.49	67.47	48.26	38.95	56	29.67	39.75	Mistral-Small-3.1-24B-Instruct-2503
?	MiniCPM-o-2_6	48.08	10.38	61.99	74.5	50.59	65.91	49.72	34.34	64.67	29.75	39	MiniCPM-o-2_6
?	VeBrain	46.91	11.34	65.03	78.57	42.48	70.52	47.37	26.3	58	31.79	37.75	VeBrain
?	Qwen2.5-VL-7B-Instruct	46.82	7.61	66.76	79.53	49.33	70.25	48.12	16.02	52	41.86	36.75	Qwen2.5-VL-7B-Instruct
?	Qwen2-VL-7B-Instruct	46.14	2.52	60.55	77.44	39.97	64.53	43.67	28.21	68	40.8	35.75	Qwen2-VL-7B-Instruct
?	Cosmos-Reason1-7B	43.34	5.51	52.75	74.71	38.81	65.22	48.26	25.64	60.67	26.87	35	Cosmos-Reason1-7B
?	Qwen2.5-VL-3B-Instruct	42.53	11	57.8	71.02	40.1	61.92	43.15	23.67	46.67	33.61	36.34	Qwen2.5-VL-3B-Instruct
?	Qwen2-VL-2B-Instruct	40.2	6.33	45.52	66.82	37.86	53.27	44.81	20	66	35.81	25.56	Qwen2-VL-2B-Instruct
?	Magma-8B	39.3	10.89	51.73	60.98	33.71	64.59	41.42	12.65	71.33	18.47	27.25	Magma-8B
?	Molmo-7B-D-0924	39.08	9.64	49.71	63.02	31.44	62.01	46.44	29.41	32	30.43	36.75	Molmo-7B-D-0924
?	Phi-4-multimodal-instruct	32.32	3.04	49.42	56.26	12.16	45.47	38.41	19.16	45.33	23.09	30.83	Phi-4-multimodal-instruct
?	gemini-2.5-flash_0918	0	0	0	0	0	0	0	0	0	0	0	gemini-2.5-flash_0918


?	Mistral-Small-3.1-24B-Instruct-2503	61.18	42.38	68.21	84.59	59.87	78.74	51.78	47.81	79.33	42.85	56.25		float16	0		Mistral-Small-3.1-24B-Instruct-2503


?	gemini-2.5-pro-preview-05-06	61.18	42.38	68.21	84.59	59.87	78.74	51.78	47.81	79.33	42.85	56.25	float16	gemini-2.5-pro-preview-05-06
?	o4-mini-2025-04-16	59.18	26.59	77.31	85.21	51.25	78.29	60.18	41.96	82	41.11	47.87	float16	o4-mini-2025-04-16
?	InternVL3-78B	58.63	19.71	77.75	85.69	51.49	81.21	55.63	47.07	66	57.76	44	float16	InternVL3-78B
?	gemini-2.5-flash-preview-04-17	58.17	28.6	71.68	84.03	54.1	74.75	56.66	48.83	74	37.09	52	float16	gemini-2.5-flash-preview-04-17
?	Qwen2.5-VL-32B-Instruct	57.05	22.52	71.39	81.59	52.16	74.45	53.61	36.07	80	56.25	42.5	float16	Qwen2.5-VL-32B-Instruct
?	gemini-2.0-pro-exp-02-05	56.36	37.51	60.55	82.85	58.68	75.38	59.43	46.96	57.33	43.3	41.6	float16	gemini-2.0-pro-exp-02-05
?	Qwen2.5-VL-72B-Instruct	55.73	39.92	72.11	82.68	48.33	73.3	54.64	35.51	58.67	53.75	38.35	float16	Qwen2.5-VL-72B-Instruct
?	claude-sonnet-4-20250514-official	55.32	25.63	72.98	78.43	51.26	64.26	51.31	47.02	75.33	41.26	45.75	float16	claude-sonnet-4-20250514-official
?	gpt-4o-2024-11-20	54.2	20.41	73.55	78.63	44.42	71.92	54.27	43.6	66.67	41.79	46.75	float16	gpt-4o-2024-11-20
?	InternVL2_5-38B	53.22	14.96	70.66	82.07	42.35	78.96	53.28	41.45	68	40.2	40.25	float16	InternVL2_5-38B
?	InternVL3-8B	51.79	9.19	66.91	83.06	44.36	74.89	49.77	40.99	63.33	42.92	42.5	float16	InternVL3-8B
?	Mistral-Small-3.1-24B-Instruct-2503	48.82	16.28	67.49	76.85	47.49	67.47	48.26	38.95	56	29.67	39.75	float16	Mistral-Small-3.1-24B-Instruct-2503
?	MiniCPM-o-2_6	48.08	10.38	61.99	74.5	50.59	65.91	49.72	34.34	64.67	29.75	39	float16	MiniCPM-o-2_6
?	VeBrain	46.91	11.34	65.03	78.57	42.48	70.52	47.37	26.3	58	31.79	37.75	float16	VeBrain
?	Qwen2.5-VL-7B-Instruct	46.82	7.61	66.76	79.53	49.33	70.25	48.12	16.02	52	41.86	36.75	float16	Qwen2.5-VL-7B-Instruct
?	Qwen2-VL-7B-Instruct	46.14	2.52	60.55	77.44	39.97	64.53	43.67	28.21	68	40.8	35.75	float16	Qwen2-VL-7B-Instruct
?	Cosmos-Reason1-7B	43.34	5.51	52.75	74.71	38.81	65.22	48.26	25.64	60.67	26.87	35	float16	Cosmos-Reason1-7B
?	Qwen2.5-VL-3B-Instruct	42.53	11	57.8	71.02	40.1	61.92	43.15	23.67	46.67	33.61	36.34	float16	Qwen2.5-VL-3B-Instruct
?	Qwen2-VL-2B-Instruct	40.2	6.33	45.52	66.82	37.86	53.27	44.81	20	66	35.81	25.56	float16	Qwen2-VL-2B-Instruct
?	Magma-8B	39.3	10.89	51.73	60.98	33.71	64.59	41.42	12.65	71.33	18.47	27.25	float16	Magma-8B
?	Molmo-7B-D-0924	39.08	9.64	49.71	63.02	31.44	62.01	46.44	29.41	32	30.43	36.75	float16	Molmo-7B-D-0924
?	Phi-4-multimodal-instruct	32.32	3.04	49.42	56.26	12.16	45.47	38.41	19.16	45.33	23.09	30.83	float16	Phi-4-multimodal-instruct
?	gemini-2.5-flash_0918	0	0	0	0	0	0	0	0	0	0	0	float16	gemini-2.5-flash_0918

The Goal of FlagEval - EmbodiedVerse

感谢您积极的参与评测，在未来，我们会持续推动 FlagEval - EmbodiedVerse 更加完善，维护生态开放，欢迎开发者参与评测方法、工具和数据集的探讨，让我们一起建设更加科学、开放的具身评测工具链。

Thanks for your active participation in the evaluation. In the future, we will continue to promote FlagEval - EmbodiedVerse to be more perfect and maintain the openness of the ecosystem, and we welcome developers to participate in the discussion of evaluation methodology, tools and datasets, so that we can build a more scientific and open embodied evaluation toolchain together.

Context

FlagEval-EmbodiedVerse是科学、全面的具身评测工具链，具体包括FlagEvalMM多模态评估框架、EmbodiedVerse具身智能高质量评测数据集以及Leaderboard具身模型能力可视化榜单。我们希望能够推动更加开放的生态，让具身智能大模型开发者参与进来，为推动具身智能大模型进步做出相应的贡献。为了实现公平性的目标，所有模型都在 FlagEvalMM框架下使用标准化 GPU 和统一环境进行评估，以确保公平性。

FlagEval-EmbodiedVerse is a scientific and comprehensive embodied evaluation toolchain, which specifically includes the FlagEvalMM multimodal evaluation framework, the EmbodiedVerse high-quality embodied intelligence evaluation dataset, and the Leaderboard for visualizing the capabilities of embodied models.

We hope to promote a more open ecosystem for embodied model developers to participate and contribute accordingly to the advancement of embodied models. To achieve the goal of fairness,all models are evaluated under the FlagEvalMM framework using standardized GPUs and a unified environment to ensure fairness.

EmbodiedVerse Open

EmbodiedVerse-Open是一个用于全面评测模型在具身智能场景下的meta dataset，包括：

数据集:

Where2Place: 包含100张来自不同杂乱环境的真实世界图像，每张图像都标注了一句描述所需自由空间位置的语句和一个对应的掩码，用于评估基于空间关系的自由空间指代表达。

Blink: 包含一些可以被人类轻松解决的视觉问题，EmbodiedVerse采样了和空间理解相关的类别(Counting, Relative_Depth,Spatial_Relation, Multi-view_Reasoning，Visual_Correspondence）

CVBench: 一个以视觉为中心的数据集，包含2638个人工筛选的问题。

RoboSpatial-Home: 一个旨在评估视觉语言模型（VLMs）在真实室内机器人环境中空间推理能力的新基准。

EmbspatialBench: 一个用于评估具身视觉语言模型（LVLM）具身空间理解能力的基准。该基准自动从具身场景中提取，并从自我中心视角涵盖 6 种空间关系。

All-Angles Bench: 一个多视图理解基准，包含 90 个真实场景中超过 2100 个人工标注的多视图问答对。.

VSI-Bench: 一个基于视频的基准数据集，从真实室内场景的自我中心视角视频中构造问题，旨在评估多模态大模型的视觉空间智能。EmbodiedVerse使用了包含400问题的tiny子集。

SAT: 一个具有挑战性的真实图像动态空间测试集。

EgoPlan-Bench2: 一个涵盖 4 大领域和 24 个详细场景的日常任务基准，与人类日常生活紧密契合。

ERQA: 这个评估基准涵盖了与空间推理和世界知识相关的各种主题，侧重于现实世界的场景，尤其是在机器人技术背景下。

EmbodiedVerse-Open is a meta-dataset composed of 10 datasets for comprehensively evaluating models in embodied intelligence scenarios, including:

Where2Place: The dataset is a collection of 100 real-world images from diverse cluttered environments, each annotated with a sentence describing a desired free space and a corresponding mask, designed to evaluate free space referencing using spatial relations.

Blink: Including some visual problems that can be easily solved by humans, EmbodiedVerse samples categories related to spatial understanding (Counting, Relative_Depth, Spatial_Relation, Multi-view_Reasoning, Visual_Correspondence).

CVBench: A vision-centric benchmarks, containing 2638 manually-inspected examples.

RoboSpatial-Home: A new spatial reasoning benchmark designed to evaluate vision-language models (VLMs) in real-world indoor environments for robotics.

EmbspatialBench: A benchmark for evaluating embodied spatial understanding of LVLM. The benchmark is automatically derived from embodied scenes and covers 6 spatial relationships from an egocentric perspective.

All-Angles Bench: A Benchmark for Multi-View Understanding, including over 2,100 human-annotated multi-view QA pairs across 90 real-world scenes.

VSI-Bench: A video-based benchmark dataset constructs questions from egocentric-view videos of real indoor scenes, aiming to evaluate the visual-spatial intelligence of multimodal large models. EmbodiedVerse uses a tiny subset containing 400 questions.

SAT: A challenging real-image dynamic spatial benchmark.

EgoPlan-Bench2: A benchmark which encompasses everyday tasks spanning4 major domains and 24 detailed scenarios, closely aligned with human daily life.

ERQA: This evaluation benchmark covers a variety of topics related to spatial reasoning and world knowledge focused on real-world scenarios, particularly in the context of robotics.

评测指标缩写介绍如下：

Evaluation Metrics Abbreviations are introduced below:

Perception

Perception_Visual Grounding(P_VG)
Perception_Counting(P_C)
Perception_State & Activity Understanding

SpatialReasoning

SpatialReasoning_Dynamic(SR_D)
SpatialReasoning_Relative direction(SR_Rd)
SpatialReasoning_Multi-view matching(SR_Mm)
SpatialReasoning_Relative distance(SR_Rd)
SpatialReasoning_Depth estimation(SR_De)
SpatialReasoning_Relative shape(SR_Rs)
SpatialReasoning_Size estimation(SR_Se)

Prediction

Prediction_Trajectory(P_T)
Prediction_Futureprediction(P_Fd)

Planning

Planning_Goal Decomposition(P_GD)
Planning_Navigation(P_N)

EmbodiedVerse Open Sample

我们对上述10个数据集的数据进行了能力维度划分，归纳出具身智能场景需要的4大能力维度空间理解，感知，预测，规划。并按照能力维度，采样出一个样本数为2042的优质子集，能力维度定义和各维度的数据量如下：

We have categorized the data of the above 10 datasets by capability dimensions, and summarized four major capability dimensions required for embodied intelligence scenarios: spatial reasoning, perception, prediction, and planning. According to the capability dimensions, a high-quality subset with 2,042 samples was sampled. The definitions of the capability dimensions and the data volume of each dimension are as follows:

Capability Dimension (能力维度)	Sub-capability Dimension (子能力维度)	Data Volume (数据量)	Percentage (百分比)
Spatial Reasoning	Dynamic	200	18.43%
	Relative direction	200	18.43%
	Multi-view matching	200	18.43%
	Relative distance	200	18.43%
	Depth estimation	107	9.86%
	Relative shape	82	7.56%
	Size estimation	96	8.85%
Perception	Visual Grounding	200	44.64%
	Counting	200	44.64%
	State & Activity Understanding	48	10.71%
Prediction	Trajectory	188	76.73%
Prediction	Future prediction	57	23.27%
Planning	Goal Decomposition	200	75.76%
Planning	Navigation	64	24.24%

EmbodiedVerse Tool - FlagEvalMM

FlagEvalMM是一个开源评估框架，旨在全面评估多模态模型，其提供了一种标准化的方法来评估跨各种任务和指标使用多种模式（文本、图像、视频）的模型。

灵活的架构：支持多个多模态模型和评估任务，包括VQA、图像检索、文本到图像等。
全面的基准与度量：支持最新的和常用的基准和度量。
广泛的模型支持：model_zoo为广泛流行的多模态模型（包括QWenVL和LLaVA）提供了推理支持。此外，它还提供了与基于API的模型（如GPT、Claude和Hunyuan）的无缝集成。
可扩展的设计：易于扩展，可合并新的模型、基准和评估指标。

FlagEvalMM is an open-source evaluation framework designed to comprehensively assess multimodal models. It provides a standardized way to evaluate models that work with multiple modalities (text, images, video) across various tasks and metrics.

Flexible Architecture: Support for multiple multimodal models and evaluation tasks, including: VQA, image retrieval, text-to-image, etc.
Comprehensive Benchmarks and Metrics: Support new and commonly used benchmarks and metrics.
Extensive Model Support: The model_zoo provides inference support for a wide range of popular multimodal models including QWenVL and LLaVA. Additionally, it offers seamless integration with API-based models such as GPT, Claude, and HuanYuan.
Extensible Design: Easily extendable to incorporate new models, benchmarks, and evaluation metrics.

Details and logs

You can find:

detailed numerical results in the results Hugging Face dataset: https://huggingface.co/datasets/open-cn-llm-leaderboard/EmbodiedVerse_results
community queries and running status in the requests Hugging Face dataset: https://huggingface.co/datasets/open-cn-llm-leaderboard/EmbodiedVerse_request

Useful links

FlagEvalMM：https://github.com/flageval-baai/FlagEvalMM

FlagEval：https://flageval.baai.ac.cn/#/home

VLM Leaderboard：https://huggingface.co/spaces/BAAI/open_flageval_vlm_leaderboard

Evaluation Queue for the FlagEval VLM Leaderboard

Models added here will be automatically evaluated on the FlagEval cluster.

Currently, we offer two methods for model evaluation, including API calls and private deployments:

If you choose to evaluate via API call, you need to provide the Model interface, name and corresponding API key.
If you choose to do open source model evaluation directly through huggingface, you don't need to fill in the Model online api url and Model online api key.

Open API model Integration Documentation

For models accessed via API calls (such as OpenAI API, Anthropic API, etc.), the integration process is straightforward and only requires providing necessary configuration information.

model_name: Name of the model to use
api_key: API access key
api_base: Base URL for the API service

Adding a Custom Model to the Platform

This guide explains how to integrate your custom model into the platform by implementing a model adapter and run.sh script. We'll use the Qwen-VL implementation as a reference example.

Overview

To add your custom model, you need to:

Create a custom dataset class
Implement a model adapter class
Set up the initialization and inference pipeline

Step-by-Step Implementation

Here is an example: Qwen-VL model_adapter.py

1. Create Preprocess Custom Dataset Class

Inherit from ServerDataset to handle data loading:

# model_adapter.py
class CustomDataset(ServerDataset):
    def __getitem__(self, index):
        data = self.get_data(index)
        question_id = data["question_id"]
        img_path = data["img_path"]
        qs = data["question"]
        qs, idx = process_images_symbol(qs)
        idx = set(idx)
        img_path_idx = []
        for i in idx:
            if i < len(img_path):
                img_path_idx.append(img_path[i])
            else:
                print("[warning] image index out of range")
        return question_id, img_path_idx, qs

The function get_data returns a structure like this:

{
    "img_path": "A list where each element is an absolute path to an image that can be read directly using PIL, cv2, etc.",
    "question": "A string containing the question, where image positions are marked with <image1> <image2>",
    "question_id": "question_id",
    "type": "A string indicating the type of question"
}

2. Implement Model Adapter

Inherit from BaseModelAdapter and implement the required methods:

model_init: is responsible for model initialization and serves as the entry point for model loading and setup.
run_one_task: implements the inference pipeline, handling data processing and result generation for a single evaluation task.

# model_adapter.py
class ModelAdapter(BaseModelAdapter):
    def model_init(self, task_info: Dict):
        ckpt_path = task_info["model_path"]
        '''
        Initialize the model and processor here.
        Load your pre-trained model and any required processing tools using the provided checkpoint path.
        '''

    def run_one_task(self, task_name: str, meta_info: Dict[str, Any]):
        results = []
        cnt = 0

        data_loader = self.create_data_loader(
            CustomDataset, task_name, batch_size=1, num_workers=0
        )
        
        for question_id, img_path, qs in data_loader:
            
            '''
            Perform model inference here.
            Use the model to generate the 'answer' variable for the given inputs (e.g., question_id, image path, question).
            '''
            
            results.append(
                {"question_id": question_id, "answer": answer}
            )
        
        self.save_result(results, meta_info, rank=rank)
        '''
        Save the inference results.
        Use the provided meta_info and rank parameters to manage result storage as needed.
        '''

Note:

results is a list of dictionaries.
Each dictionary must contain two keys:

{
    "question_id": "identifies the specific question",
    "answer": "contains the model's prediction/output"
}

After collecting all results, they are saved using save_result().

3. Launch Script (`run.sh`)

run.sh is the entry script for launching model evaluation, used to set environment variables and start the evaluation program.

#!/bin/bash
current_file="$0"
current_dir="$(dirname "$current_file")"
SERVER_IP=$1
SERVER_PORT=$2
PYTHONPATH=$current_dir:$PYTHONPATH python $current_dir/model_adapter.py     --server_ip $SERVER_IP     --server_port $SERVER_PORT      "${@:3}"


gemini-2.5-flash_0918	main	false	float16	Original	FINISHED


gemini-2.5-flash_0918	main	false	float16	Original	FINISHED

✨ Submit your modelinfos here!

Model Name

Revision commit

📧 Submit your API infos here! (API only)

Model online api url

Model online api key

Online api model name

📧 Submit your inference infos here! (inference only)

upload run.sh file

File

upload model_adapter.py file

File


?	gemini-2.5-pro-preview-05-06	46.15	40.18	32.5	47	43.75	62.86	50.5	73.5	53	60.5	60.75	76.83	82.29	44.08	39.36	59.65	37.5	35	45.31	gemini-2.5-pro-preview-05-06
?	o4-mini-2025-04-16	44.73	29.91	16	41	41.67	63.04	51.5	67.5	65	65	57.01	62.2	77.08	52.24	53.19	49.12	33.71	35	29.69	o4-mini-2025-04-16
?	InternVL3-78B	44.74	34.6	15	51	47.92	58.8	42.5	70	52.5	61	55.14	59.76	81.25	36.33	32.45	49.12	49.24	55.5	29.69	InternVL3-78B
?	gemini-2.5-flash-preview-04-17	46.72	38.17	25	49	47.92	59.35	52	67	53	58	52.34	69.51	73.96	52.24	47.87	66.67	37.12	32	53.12	gemini-2.5-flash-preview-04-17
?	Qwen2.5-VL-32B-Instruct	43.18	28.12	12	41	41.67	55.02	46.5	63	49.5	47	56.07	69.51	70.83	39.59	34.04	57.89	50	52.5	42.19	Qwen2.5-VL-32B-Instruct
?	gemini-2.0-pro-exp-02-05	41.45	35.94	32	38	43.75	51.43	39	63	42	54	44.86	80.49	50	42.45	38.83	54.39	35.98	37.5	31.25	gemini-2.0-pro-exp-02-05
?	Qwen2.5-VL-72B-Instruct	42.25	34.15	24	41.5	45.83	50.05	39	62.5	46	51.5	46.73	52.44	54.17	41.63	35.64	61.4	43.18	46.5	32.81	Qwen2.5-VL-72B-Instruct
?	claude-sonnet-4-20250514-official	41.45	35.94	32	38	43.75	51.43	39	63	42	54	44.86	80.49	50	42.45	38.83	54.39	35.98	37.5	31.25	claude-sonnet-4-20250514-official
?	gpt-4o-2024-11-20	42.16	30.36	14	42.5	47.92	54.56	43.5	57	53.5	57	53.27	58.54	67.71	47.35	43.62	59.65	36.36	36.5	35.94	gpt-4o-2024-11-20
?	InternVL2_5-38B	39.24	29.91	11	45.5	43.75	53.73	41.5	66	45.5	54.5	49.53	46.34	80.21	39.59	41.49	33.33	33.71	31	42.19	InternVL2_5-38B
?	InternVL3-8B	35.56	25	9	38.5	35.42	49.49	37	66.5	35	48.5	51.4	45.12	73.96	31.02	30.85	31.58	36.74	33	48.44	InternVL3-8B
?	Mistral-Small-3.1-24B-Instruct-2503	34.06	21.88	9.5	32	31.25	50.51	37.5	59	48.5	43	47.66	60.98	73.96	40	32.45	64.91	23.86	24	23.44	Mistral-Small-3.1-24B-Instruct-2503
?	MiniCPM-o-2_6	30.63	19.64	9	27.5	31.25	41.57	34.5	41	31	37.5	41.12	65.85	67.71	36.33	35.64	38.6	25	22	34.38	MiniCPM-o-2_6
?	VeBrain	25.19	22.32	8	34	33.33	0	0	0	0	0	0	0	0	29.39	32.45	19.3	23.86	22	29.69	VeBrain
?	Qwen2.5-VL-7B-Instruct	29.63	20.31	7	30	35.42	38.16	29	54.5	43	35.5	29.91	62.2	7.29	34.29	31.38	43.86	25.76	28.5	17.19	Qwen2.5-VL-7B-Instruct
?	Qwen2-VL-7B-Instruct	28.53	17.63	4	28	31.25	39.35	42.5	41.5	26.5	38.5	42.99	34.15	57.29	25.31	26.06	22.81	31.82	32.5	29.69	Qwen2-VL-7B-Instruct
?	Cosmos-Reason1-7B	22.56	21.43	7.5	32.5	33.33	0	0	0	0	0	0	0	0	26.94	29.79	17.54	19.32	17.5	25	Cosmos-Reason1-7B
?	Qwen2.5-VL-3B-Instruct	26.44	17.86	6	28.5	22.92	34.01	36	31.5	26	32	39.25	41.46	43.75	27.76	29.26	22.81	26.14	24	32.81	Qwen2.5-VL-3B-Instruct
?	Qwen2-VL-2B-Instruct	28.01	11.61	5	17	16.67	37.42	43.5	40	25.5	37	55.14	39.02	23.96	38.78	45.21	17.54	24.24	22	31.25	Qwen2-VL-2B-Instruct
?	Magma-8B	27.48	21.21	11	32	18.75	34.29	44	45.5	31.5	37.5	29.91	21.95	5.21	30.2	36.7	8.77	24.24	21.5	32.81	Magma-8B
?	Molmo-7B-D-0924	32.69	21.43	5	34.5	35.42	42.21	41	43.5	33.5	42.5	51.4	41.46	50	43.27	46.28	33.33	23.86	20.5	34.38	Molmo-7B-D-0924
?	Phi-4-multimodal-instruct	21.68	18.3	6	28.5	27.08	25.35	29.5	22	25.5	30	14.02	24.39	27.08	24.9	25.53	22.81	18.18	14	31.25	Phi-4-multimodal-instruct
?	gemini-2.5-flash_0918	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	gemini-2.5-flash_0918


?	gemini-2.5-pro-preview-05-06	46.15	40.18	32.5	47	43.75	62.86	50.5	73.5	53	60.5	60.75	76.83	82.29	44.08	39.36	59.65	37.5	35	45.31	float16	gemini-2.5-pro-preview-05-06
?	o4-mini-2025-04-16	44.73	29.91	16	41	41.67	63.04	51.5	67.5	65	65	57.01	62.2	77.08	52.24	53.19	49.12	33.71	35	29.69	float16	o4-mini-2025-04-16
?	InternVL3-78B	44.74	34.6	15	51	47.92	58.8	42.5	70	52.5	61	55.14	59.76	81.25	36.33	32.45	49.12	49.24	55.5	29.69	float16	InternVL3-78B
?	gemini-2.5-flash-preview-04-17	46.72	38.17	25	49	47.92	59.35	52	67	53	58	52.34	69.51	73.96	52.24	47.87	66.67	37.12	32	53.12	float16	gemini-2.5-flash-preview-04-17
?	Qwen2.5-VL-32B-Instruct	43.18	28.12	12	41	41.67	55.02	46.5	63	49.5	47	56.07	69.51	70.83	39.59	34.04	57.89	50	52.5	42.19	float16	Qwen2.5-VL-32B-Instruct
?	gemini-2.0-pro-exp-02-05	41.45	35.94	32	38	43.75	51.43	39	63	42	54	44.86	80.49	50	42.45	38.83	54.39	35.98	37.5	31.25	float16	gemini-2.0-pro-exp-02-05
?	Qwen2.5-VL-72B-Instruct	42.25	34.15	24	41.5	45.83	50.05	39	62.5	46	51.5	46.73	52.44	54.17	41.63	35.64	61.4	43.18	46.5	32.81	float16	Qwen2.5-VL-72B-Instruct
?	claude-sonnet-4-20250514-official	41.45	35.94	32	38	43.75	51.43	39	63	42	54	44.86	80.49	50	42.45	38.83	54.39	35.98	37.5	31.25	float16	claude-sonnet-4-20250514-official
?	gpt-4o-2024-11-20	42.16	30.36	14	42.5	47.92	54.56	43.5	57	53.5	57	53.27	58.54	67.71	47.35	43.62	59.65	36.36	36.5	35.94	float16	gpt-4o-2024-11-20
?	InternVL2_5-38B	39.24	29.91	11	45.5	43.75	53.73	41.5	66	45.5	54.5	49.53	46.34	80.21	39.59	41.49	33.33	33.71	31	42.19	float16	InternVL2_5-38B
?	InternVL3-8B	35.56	25	9	38.5	35.42	49.49	37	66.5	35	48.5	51.4	45.12	73.96	31.02	30.85	31.58	36.74	33	48.44	float16	InternVL3-8B
?	Mistral-Small-3.1-24B-Instruct-2503	34.06	21.88	9.5	32	31.25	50.51	37.5	59	48.5	43	47.66	60.98	73.96	40	32.45	64.91	23.86	24	23.44	float16	Mistral-Small-3.1-24B-Instruct-2503
?	MiniCPM-o-2_6	30.63	19.64	9	27.5	31.25	41.57	34.5	41	31	37.5	41.12	65.85	67.71	36.33	35.64	38.6	25	22	34.38	float16	MiniCPM-o-2_6
?	VeBrain	25.19	22.32	8	34	33.33	0	0	0	0	0	0	0	0	29.39	32.45	19.3	23.86	22	29.69	float16	VeBrain
?	Qwen2.5-VL-7B-Instruct	29.63	20.31	7	30	35.42	38.16	29	54.5	43	35.5	29.91	62.2	7.29	34.29	31.38	43.86	25.76	28.5	17.19	float16	Qwen2.5-VL-7B-Instruct
?	Qwen2-VL-7B-Instruct	28.53	17.63	4	28	31.25	39.35	42.5	41.5	26.5	38.5	42.99	34.15	57.29	25.31	26.06	22.81	31.82	32.5	29.69	float16	Qwen2-VL-7B-Instruct
?	Cosmos-Reason1-7B	22.56	21.43	7.5	32.5	33.33	0	0	0	0	0	0	0	0	26.94	29.79	17.54	19.32	17.5	25	float16	Cosmos-Reason1-7B
?	Qwen2.5-VL-3B-Instruct	26.44	17.86	6	28.5	22.92	34.01	36	31.5	26	32	39.25	41.46	43.75	27.76	29.26	22.81	26.14	24	32.81	float16	Qwen2.5-VL-3B-Instruct
?	Qwen2-VL-2B-Instruct	28.01	11.61	5	17	16.67	37.42	43.5	40	25.5	37	55.14	39.02	23.96	38.78	45.21	17.54	24.24	22	31.25	float16	Qwen2-VL-2B-Instruct
?	Magma-8B	27.48	21.21	11	32	18.75	34.29	44	45.5	31.5	37.5	29.91	21.95	5.21	30.2	36.7	8.77	24.24	21.5	32.81	float16	Magma-8B
?	Molmo-7B-D-0924	32.69	21.43	5	34.5	35.42	42.21	41	43.5	33.5	42.5	51.4	41.46	50	43.27	46.28	33.33	23.86	20.5	34.38	float16	Molmo-7B-D-0924
?	Phi-4-multimodal-instruct	21.68	18.3	6	28.5	27.08	25.35	29.5	22	25.5	30	14.02	24.39	27.08	24.9	25.53	22.81	18.18	14	31.25	float16	Phi-4-multimodal-instruct
?	gemini-2.5-flash_0918	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	float16	gemini-2.5-flash_0918

FlagEval-EmbodiedVerse

评测指标缩写介绍如下：

The Goal of FlagEval - EmbodiedVerse

Context

EmbodiedVerse Open

评测指标缩写介绍如下：

EmbodiedVerse Open Sample

EmbodiedVerse Tool - FlagEvalMM

Details and logs

Useful links

Evaluation Queue for the FlagEval VLM Leaderboard

Open API model Integration Documentation

Adding a Custom Model to the Platform

Overview

Step-by-Step Implementation

1. Create Preprocess Custom Dataset Class

2. Implement Model Adapter

3. Launch Script (run.sh)

3. Launch Script (`run.sh`)