FlagEval-Embodied Verse
欢迎使用FlagEval-Embodied Verse! FlagEval-Embodied Verse 旨在通过FlagEval具身工具链跟踪、排名和评估具身大模型(Embodied model),其中FlagEvalMM提供了多模态评估架构,Embodied Verse构建了一种基于具身智能高质量评测数据集的能力体系,Leaderboard则通过榜单实时跟踪并呈现不同具身大模型综合能力。
Welcome to the FlagEval-Embodied Verse! FlagEval-Embodied Verse aims to track, rank, and evaluate embodied models through the FlagEval embodied toolchain. FlagEvalMM provides a multimodal evaluation framework, while Embodied Verse builds a capability system based on high-quality evaluation datasets for embodied intelligence. The Leaderboard tracks and presents the comprehensive capabilities of different embodied large models in real time through a leaderboard.
? | 61.18 | 42.38 | 68.21 | 84.59 | 59.87 | 78.74 | 51.78 | 47.81 | 79.33 | 42.85 | 56.25 | Mistral-Small-3.1-24B-Instruct-2503 |
? | 61.18 | 42.38 | 68.21 | 84.59 | 59.87 | 78.74 | 51.78 | 47.81 | 79.33 | 42.85 | 56.25 | float16 | 0 | Mistral-Small-3.1-24B-Instruct-2503 |
? | 61.18 | 42.38 | 68.21 | 84.59 | 59.87 | 78.74 | 51.78 | 47.81 | 79.33 | 42.85 | 56.25 | float16 | 0 | gemini-2.5-pro-preview-05-06 | |||
? | 59.18 | 26.59 | 77.31 | 85.21 | 51.25 | 78.29 | 60.18 | 41.96 | 82 | 41.11 | 47.87 | float16 | 0 | o4-mini-2025-04-16 | |||
? | 58.63 | 19.71 | 77.75 | 85.69 | 51.49 | 81.21 | 55.63 | 47.07 | 66 | 57.76 | 44 | float16 | 0 | InternVL3-78B | |||
? | 58.17 | 28.6 | 71.68 | 84.03 | 54.1 | 74.75 | 56.66 | 48.83 | 74 | 37.09 | 52 | float16 | 0 | gemini-2.5-flash-preview-04-17 | |||
? | 57.05 | 22.52 | 71.39 | 81.59 | 52.16 | 74.45 | 53.61 | 36.07 | 80 | 56.25 | 42.5 | float16 | 0 | Qwen2.5-VL-32B-Instruct | |||
? | 56.36 | 37.51 | 60.55 | 82.85 | 58.68 | 75.38 | 59.43 | 46.96 | 57.33 | 43.3 | 41.6 | float16 | 0 | gemini-2.0-pro-exp-02-05 | |||
? | 55.73 | 39.92 | 72.11 | 82.68 | 48.33 | 73.3 | 54.64 | 35.51 | 58.67 | 53.75 | 38.35 | float16 | 0 | Qwen2.5-VL-72B-Instruct | |||
? | 55.32 | 25.63 | 72.98 | 78.43 | 51.26 | 64.26 | 51.31 | 47.02 | 75.33 | 41.26 | 45.75 | float16 | 0 | claude-sonnet-4-20250514-official | |||
? | 54.2 | 20.41 | 73.55 | 78.63 | 44.42 | 71.92 | 54.27 | 43.6 | 66.67 | 41.79 | 46.75 | float16 | 0 | gpt-4o-2024-11-20 | |||
? | 53.22 | 14.96 | 70.66 | 82.07 | 42.35 | 78.96 | 53.28 | 41.45 | 68 | 40.2 | 40.25 | float16 | 0 | InternVL2_5-38B | |||
? | 51.79 | 9.19 | 66.91 | 83.06 | 44.36 | 74.89 | 49.77 | 40.99 | 63.33 | 42.92 | 42.5 | float16 | 0 | InternVL3-8B | |||
? | 48.82 | 16.28 | 67.49 | 76.85 | 47.49 | 67.47 | 48.26 | 38.95 | 56 | 29.67 | 39.75 | float16 | 0 | Mistral-Small-3.1-24B-Instruct-2503 | |||
? | 48.08 | 10.38 | 61.99 | 74.5 | 50.59 | 65.91 | 49.72 | 34.34 | 64.67 | 29.75 | 39 | float16 | 0 | MiniCPM-o-2_6 | |||
? | 46.82 | 7.61 | 66.76 | 79.53 | 49.33 | 70.25 | 48.12 | 16.02 | 52 | 41.86 | 36.75 | float16 | 0 | Qwen2.5-VL-7B-Instruct | |||
? | 46.14 | 2.52 | 60.55 | 77.44 | 39.97 | 64.53 | 43.67 | 28.21 | 68 | 40.8 | 35.75 | float16 | 0 | Qwen2-VL-2B-Instruct | |||
? | 42.53 | 11 | 57.8 | 71.02 | 40.1 | 61.92 | 43.15 | 23.67 | 46.67 | 33.61 | 36.34 | float16 | 0 | Qwen2.5-VL-3B-Instruct | |||
? | 39.3 | 10.89 | 51.73 | 60.98 | 33.71 | 64.59 | 41.42 | 12.65 | 71.33 | 18.47 | 27.25 | float16 | 0 | Magma-8B | |||
? | 39.08 | 9.64 | 49.71 | 63.02 | 31.44 | 62.01 | 46.44 | 29.41 | 32 | 30.43 | 36.75 | float16 | 0 | Molmo-7B-D-0924 | |||
? | 32.32 | 3.04 | 49.42 | 56.26 | 12.16 | 45.47 | 38.41 | 19.16 | 45.33 | 23.09 | 30.83 | float16 | 0 | Phi-4-multimodal-instruct |
? | 46.15 | 40.18 | 32.5 | 41.5 | 43.75 | 62.86 | 50.5 | 73.5 | 52.5 | 60.5 | 60.75 | 76.83 | 82.29 | 44.08 | 39.36 | 59.65 | 33.71 | 55.5 | 45.31 | Mistral-Small-3.1-24B-Instruct-2503 |
? | 46.15 | 40.18 | 32.5 | 47 | 43.75 | 62.86 | 50.5 | 73.5 | 53 | 60.5 | 60.75 | 76.83 | 82.29 | 44.08 | 39.36 | 59.65 | 37.5 | 35 | 45.31 | gemini-2.5-pro-preview-05-06 | |
? | 44.73 | 29.91 | 16 | 41 | 41.67 | 63.04 | 51.5 | 67.5 | 65 | 65 | 57.01 | 62.2 | 77.08 | 52.24 | 53.19 | 49.12 | 33.71 | 35 | 29.69 | o4-mini-2025-04-16 | |
? | 44.74 | 34.6 | 15 | 51 | 47.92 | 58.8 | 42.5 | 70 | 52.5 | 61 | 55.14 | 59.76 | 81.25 | 36.33 | 32.45 | 49.12 | 49.24 | 55.5 | 29.69 | InternVL3-78B | |
? | 46.72 | 38.17 | 25 | 49 | 47.92 | 59.35 | 52 | 67 | 53 | 58 | 52.34 | 69.51 | 73.96 | 52.24 | 47.87 | 66.67 | 37.12 | 32 | 53.12 | gemini-2.5-flash-preview-04-17 | |
? | 43.18 | 28.12 | 12 | 41 | 41.67 | 55.02 | 46.5 | 63 | 49.5 | 47 | 56.07 | 69.51 | 70.83 | 39.59 | 34.04 | 57.89 | 50 | 52.5 | 42.19 | Qwen2.5-VL-32B-Instruct | |
? | 41.45 | 35.94 | 32 | 38 | 43.75 | 51.43 | 39 | 63 | 42 | 54 | 44.86 | 80.49 | 50 | 42.45 | 38.83 | 54.39 | 35.98 | 37.5 | 31.25 | gemini-2.0-pro-exp-02-05 | |
? | 42.25 | 34.15 | 24 | 41.5 | 45.83 | 50.05 | 39 | 62.5 | 46 | 51.5 | 46.73 | 52.44 | 54.17 | 41.63 | 35.64 | 61.4 | 43.18 | 46.5 | 32.81 | Qwen2.5-VL-72B-Instruct | |
? | 41.45 | 35.94 | 32 | 38 | 43.75 | 51.43 | 39 | 63 | 42 | 54 | 44.86 | 80.49 | 50 | 42.45 | 38.83 | 54.39 | 35.98 | 37.5 | 31.25 | claude-sonnet-4-20250514-official | |
? | 42.16 | 30.36 | 14 | 42.5 | 47.92 | 54.56 | 43.5 | 57 | 53.5 | 57 | 53.27 | 58.54 | 67.71 | 47.35 | 43.62 | 59.65 | 36.36 | 36.5 | 35.94 | gpt-4o-2024-11-20 | |
? | 39.24 | 29.91 | 11 | 45.5 | 43.75 | 53.73 | 41.5 | 66 | 45.5 | 54.5 | 49.53 | 46.34 | 80.21 | 39.59 | 41.49 | 33.33 | 33.71 | 31 | 42.19 | InternVL2_5-38B | |
? | 35.56 | 25 | 9 | 38.5 | 35.42 | 49.49 | 37 | 66.5 | 35 | 48.5 | 51.4 | 45.12 | 73.96 | 31.02 | 30.85 | 31.58 | 36.74 | 33 | 48.44 | InternVL3-8B | |
? | 34.06 | 21.88 | 9.5 | 32 | 31.25 | 50.51 | 37.5 | 59 | 48.5 | 43 | 47.66 | 60.98 | 73.96 | 40 | 32.45 | 64.91 | 23.86 | 24 | 23.44 | Mistral-Small-3.1-24B-Instruct-2503 | |
? | 30.63 | 19.64 | 9 | 27.5 | 31.25 | 41.57 | 34.5 | 41 | 31 | 37.5 | 41.12 | 65.85 | 67.71 | 36.33 | 35.64 | 38.6 | 25 | 22 | 34.38 | MiniCPM-o-2_6 | |
? | 29.63 | 20.31 | 7 | 30 | 35.42 | 38.16 | 29 | 54.5 | 43 | 35.5 | 29.91 | 62.2 | 7.29 | 34.29 | 31.38 | 43.86 | 25.76 | 28.5 | 17.19 | Qwen2.5-VL-7B-Instruct | |
? | 28.53 | 17.63 | 4 | 28 | 31.25 | 39.35 | 42.5 | 41.5 | 26.5 | 38.5 | 42.99 | 34.15 | 57.29 | 25.31 | 26.06 | 22.81 | 31.82 | 32.5 | 29.69 | Qwen2-VL-2B-Instruct | |
? | 26.44 | 17.86 | 6 | 28.5 | 22.92 | 34.01 | 36 | 31.5 | 26 | 32 | 39.25 | 41.46 | 43.75 | 27.76 | 29.26 | 22.81 | 26.14 | 24 | 32.81 | Qwen2.5-VL-3B-Instruct | |
? | 27.48 | 21.21 | 11 | 32 | 18.75 | 34.29 | 44 | 45.5 | 31.5 | 37.5 | 29.91 | 21.95 | 5.21 | 30.2 | 36.7 | 8.77 | 24.24 | 21.5 | 32.81 | Magma-8B | |
? | 32.69 | 21.43 | 5 | 34.5 | 35.42 | 42.21 | 41 | 43.5 | 33.5 | 42.5 | 51.4 | 41.46 | 50 | 43.27 | 46.28 | 33.33 | 23.86 | 20.5 | 34.38 | Molmo-7B-D-0924 | |
? | 21.68 | 18.3 | 6 | 28.5 | 27.08 | 25.35 | 29.5 | 22 | 25.5 | 30 | 14.02 | 24.39 | 27.08 | 24.9 | 25.53 | 22.81 | 18.18 | 14 | 31.25 | Phi-4-multimodal-instruct |
? | 46.15 | 40.18 | 32.5 | 41.5 | 43.75 | 62.86 | 50.5 | 73.5 | 52.5 | 60.5 | 60.75 | 76.83 | 82.29 | 44.08 | 39.36 | 59.65 | 33.71 | 55.5 | 45.31 | float16 | 0 | Mistral-Small-3.1-24B-Instruct-2503 |
? | 46.15 | 40.18 | 32.5 | 47 | 43.75 | 62.86 | 50.5 | 73.5 | 53 | 60.5 | 60.75 | 76.83 | 82.29 | 44.08 | 39.36 | 59.65 | 37.5 | 35 | 45.31 | float16 | 0 | gemini-2.5-pro-preview-05-06 | |||
? | 44.73 | 29.91 | 16 | 41 | 41.67 | 63.04 | 51.5 | 67.5 | 65 | 65 | 57.01 | 62.2 | 77.08 | 52.24 | 53.19 | 49.12 | 33.71 | 35 | 29.69 | float16 | 0 | o4-mini-2025-04-16 | |||
? | 44.74 | 34.6 | 15 | 51 | 47.92 | 58.8 | 42.5 | 70 | 52.5 | 61 | 55.14 | 59.76 | 81.25 | 36.33 | 32.45 | 49.12 | 49.24 | 55.5 | 29.69 | float16 | 0 | InternVL3-78B | |||
? | 46.72 | 38.17 | 25 | 49 | 47.92 | 59.35 | 52 | 67 | 53 | 58 | 52.34 | 69.51 | 73.96 | 52.24 | 47.87 | 66.67 | 37.12 | 32 | 53.12 | float16 | 0 | gemini-2.5-flash-preview-04-17 | |||
? | 43.18 | 28.12 | 12 | 41 | 41.67 | 55.02 | 46.5 | 63 | 49.5 | 47 | 56.07 | 69.51 | 70.83 | 39.59 | 34.04 | 57.89 | 50 | 52.5 | 42.19 | float16 | 0 | Qwen2.5-VL-32B-Instruct | |||
? | 41.45 | 35.94 | 32 | 38 | 43.75 | 51.43 | 39 | 63 | 42 | 54 | 44.86 | 80.49 | 50 | 42.45 | 38.83 | 54.39 | 35.98 | 37.5 | 31.25 | float16 | 0 | gemini-2.0-pro-exp-02-05 | |||
? | 42.25 | 34.15 | 24 | 41.5 | 45.83 | 50.05 | 39 | 62.5 | 46 | 51.5 | 46.73 | 52.44 | 54.17 | 41.63 | 35.64 | 61.4 | 43.18 | 46.5 | 32.81 | float16 | 0 | Qwen2.5-VL-72B-Instruct | |||
? | 41.45 | 35.94 | 32 | 38 | 43.75 | 51.43 | 39 | 63 | 42 | 54 | 44.86 | 80.49 | 50 | 42.45 | 38.83 | 54.39 | 35.98 | 37.5 | 31.25 | float16 | 0 | claude-sonnet-4-20250514-official | |||
? | 42.16 | 30.36 | 14 | 42.5 | 47.92 | 54.56 | 43.5 | 57 | 53.5 | 57 | 53.27 | 58.54 | 67.71 | 47.35 | 43.62 | 59.65 | 36.36 | 36.5 | 35.94 | float16 | 0 | gpt-4o-2024-11-20 | |||
? | 39.24 | 29.91 | 11 | 45.5 | 43.75 | 53.73 | 41.5 | 66 | 45.5 | 54.5 | 49.53 | 46.34 | 80.21 | 39.59 | 41.49 | 33.33 | 33.71 | 31 | 42.19 | float16 | 0 | InternVL2_5-38B | |||
? | 35.56 | 25 | 9 | 38.5 | 35.42 | 49.49 | 37 | 66.5 | 35 | 48.5 | 51.4 | 45.12 | 73.96 | 31.02 | 30.85 | 31.58 | 36.74 | 33 | 48.44 | float16 | 0 | InternVL3-8B | |||
? | 34.06 | 21.88 | 9.5 | 32 | 31.25 | 50.51 | 37.5 | 59 | 48.5 | 43 | 47.66 | 60.98 | 73.96 | 40 | 32.45 | 64.91 | 23.86 | 24 | 23.44 | float16 | 0 | Mistral-Small-3.1-24B-Instruct-2503 | |||
? | 30.63 | 19.64 | 9 | 27.5 | 31.25 | 41.57 | 34.5 | 41 | 31 | 37.5 | 41.12 | 65.85 | 67.71 | 36.33 | 35.64 | 38.6 | 25 | 22 | 34.38 | float16 | 0 | MiniCPM-o-2_6 | |||
? | 29.63 | 20.31 | 7 | 30 | 35.42 | 38.16 | 29 | 54.5 | 43 | 35.5 | 29.91 | 62.2 | 7.29 | 34.29 | 31.38 | 43.86 | 25.76 | 28.5 | 17.19 | float16 | 0 | Qwen2.5-VL-7B-Instruct | |||
? | 28.53 | 17.63 | 4 | 28 | 31.25 | 39.35 | 42.5 | 41.5 | 26.5 | 38.5 | 42.99 | 34.15 | 57.29 | 25.31 | 26.06 | 22.81 | 31.82 | 32.5 | 29.69 | float16 | 0 | Qwen2-VL-2B-Instruct | |||
? | 26.44 | 17.86 | 6 | 28.5 | 22.92 | 34.01 | 36 | 31.5 | 26 | 32 | 39.25 | 41.46 | 43.75 | 27.76 | 29.26 | 22.81 | 26.14 | 24 | 32.81 | float16 | 0 | Qwen2.5-VL-3B-Instruct | |||
? | 27.48 | 21.21 | 11 | 32 | 18.75 | 34.29 | 44 | 45.5 | 31.5 | 37.5 | 29.91 | 21.95 | 5.21 | 30.2 | 36.7 | 8.77 | 24.24 | 21.5 | 32.81 | float16 | 0 | Magma-8B | |||
? | 32.69 | 21.43 | 5 | 34.5 | 35.42 | 42.21 | 41 | 43.5 | 33.5 | 42.5 | 51.4 | 41.46 | 50 | 43.27 | 46.28 | 33.33 | 23.86 | 20.5 | 34.38 | float16 | 0 | Molmo-7B-D-0924 | |||
? | 21.68 | 18.3 | 6 | 28.5 | 27.08 | 25.35 | 29.5 | 22 | 25.5 | 30 | 14.02 | 24.39 | 27.08 | 24.9 | 25.53 | 22.81 | 18.18 | 14 | 31.25 | float16 | 0 | Phi-4-multimodal-instruct |
评测指标缩写介绍如下:
Evaluation Metrics Abbreviations are introduced below:
Perception
- Perception_Visual Grounding(P_VG)
- Perception_Counting(P_C)
- Perception_State & Activity Understanding
SpatialReasoning
- SpatialReasoning_Dynamic(SR_D)
- SpatialReasoning_Relative direction(SR_Rd)
- SpatialReasoning_Multi-view matching(SR_Mm)
- SpatialReasoning_Relative distance(SR_Rd)
- SpatialReasoning_Depth estimation(SR_De)
- SpatialReasoning_Relative shape(SR_Rs)
- SpatialReasoning_Size estimation(SR_Se)
Prediction
- Prediction_Trajectory(P_T)
- Prediction_Future prediction(P_Fp)
Planning
- Planning_Goal Decomposition(P_GD)
- Planning_Navigation(P_N)
The Goal of FlagEval - Embodied Verse
感谢您积极的参与评测,在未来,我们会持续推动 FlagEval - Embodied Verse 更加完善,维护生态开放,欢迎开发者参与评测方法、工具和数据集的探讨,让我们一起建设更加科学、开放的具身评测工具链。
Thanks for your active participation in the evaluation. In the future, we will continue to promote FlagEval - Embodied Verse to be more perfect and maintain the openness of the ecosystem, and we welcome developers to participate in the discussion of evaluation methodology, tools and datasets, so that we can build a more scientific and open embodied evaluation toolchain together.
Context
FlagEval-Embodied Verse是科学、全面的具身评测工具链,具体包括FlagEvalMM多模态评估框架、Embodied Verse具身智能高质量评测数据集以及Leaderboard具身模型能力可视化榜单。我们希望能够推动更加开放的生态,让具身智能大模型开发者参与进来,为推动具身智能大模型进步做出相应的贡献。为了实现公平性的目标,所有模型都在 FlagEvalMM框架下使用标准化 GPU 和统一环境进行评估,以确保公平性。
FlagEval-Embodied Verse is a scientific and comprehensive embodied evaluation toolchain, which specifically includes the FlagEvalMM multimodal evaluation framework, the Embodied Verse high-quality embodied intelligence evaluation dataset, and the Leaderboard for visualizing the capabilities of embodied models.
We hope to promote a more open ecosystem for embodied model developers to participate and contribute accordingly to the advancement of embodied models. To achieve the goal of fairness, all models are evaluated all models are evaluated under the FlagEvalMM framework using standardized GPUs and a unified environment to ensure fairness.
How it works
Embodied Verse tool - FlagEvalMM
FlagEvalMM是一个开源评估框架,旨在全面评估多模态模型,其提供了一种标准化的方法来评估跨各种任务和指标使用多种模式(文本、图像、视频)的模型。
- 灵活的架构:支持多个多模态模型和评估任务,包括VQA、图像检索、文本到图像等。
- 全面的基准与度量:支持最新的和常用的基准和度量。
- 广泛的模型支持:model_zoo为广泛流行的多模态模型(包括QWenVL和LLaVA)提供了推理支持。此外,它还提供了与基于API的模型(如GPT、Claude和HuanYuan)的无缝集成。
- 可扩展的设计:易于扩展,可合并新的模型、基准和评估指标。
FlagEvalMM is an open-source evaluation framework designed to comprehensively assess multimodal models. It provides a standardized way to evaluate models that work with multiple modalities (text, images, video) across various tasks and metrics.
- Flexible Architecture: Support for multiple multimodal models and evaluation tasks, including: VQA, image retrieval, text-to-image, etc.
- Comprehensive Benchmarks and Metrics: Support new and commonly used benchmarks and metrics.
- Extensive Model Support: The model_zoo provides inference support for a wide range of popular multimodal models including QWenVL and LLaVA. Additionally, it offers seamless integration with API-based models such as GPT, Claude, and HuanYuan.
- Extensible Design: Easily extendable to incorporate new models, benchmarks, and evaluation metrics.
Embodied Verse
EmbodiedVerse-Open是一个由10个数据集构成的用于全面评测模型在具身智能场景下的meta dataset,包括:
- Where2Place : 包含100张来自不同杂乱环境的真实世界图像,每张图像都标注了一句描述所需自由空间位置的语句和一个对应的掩码,用于评估基于空间关系的自由空间指代表达。
- Blink : 包含一些可以被人类轻松解决的视觉问题,EmbodiedVerse采样了和空间理解相关的类别(Counting, Relative_Depth,Spatial_Relation, Multi-view_Reasoning,Visual_Correspondence)
- CVBench : 一个以视觉为中心的数据集,包含2638个人工筛选的问题。
- RoboSpatial-Home : 一个旨在评估视觉语言模型(VLMs)在真实室内机器人环境中空间推理能力的新基准。
- EmbspatialBench : 一个用于评估具身视觉语言模型(LVLM)具身空间理解能力的基准。该基准自动从具身场景中提取,并从自我中心视角涵盖 6 种空间关系。
- All-Angles Bench : 一个多视图理解基准,包含 90 个真实场景中超过 2100 个人工标注的多视图问答对。.
- VSI-Bench : 一个基于视频的基准数据集,从真实室内场景的自我中心视角视频中构造问题,旨在评估多模态大模型的视觉空间智能。EmbodiedVerse使用了包含400问题的tiny子集。
- SAT : 一个具有挑战性的真实图像动态空间测试集。
- EgoPlan-Bench2 : 一个涵盖 4 大领域和 24 个详细场景的日常任务基准,与人类日常生活紧密契合。
- ERQA : 这个评估基准涵盖了与空间推理和世界知识相关的各种主题,侧重于现实世界的场景,尤其是在机器人技术背景下。
EmbodiedVerse-Open is a meta-dataset composed of 10 datasets for comprehensively evaluating models in embodied intelligence scenarios, including:
- Where2Place : The dataset is a collection of 100 real-world images from diverse cluttered environments, each annotated with a sentence describing a desired free space and a corresponding mask, designed to evaluate free space referencing using spatial relations.
- Blink : Including some visual problems that can be easily solved by humans, EmbodiedVerse samples categories related to spatial understanding (Counting, Relative_Depth, Spatial_Relation, Multi-view_Reasoning, Visual_Correspondence).
- CVBench : A vision-centric benchmarks, containing 2638 manually-inspected examples.
- RoboSpatial-Home : A new spatial reasoning benchmark designed to evaluate vision-language models (VLMs) in real-world indoor environments for robotics.
- EmbspatialBench : A benchmark for evaluating embodied spatial understanding of LVLM. The benchmark is automatically derived from embodied scenes and covers 6 spatial relationships from an egocentric perspective.
- All-Angles Bench : A Benchmark for Multi-View Understanding, including over 2,100 human-annotated multi-view QA pairs across 90 real-world scenes.
- VSI-Bench : A video-based benchmark dataset constructs questions from egocentric-view videos of real indoor scenes, aiming to evaluate the visual-spatial intelligence of multimodal large models. EmbodiedVerse uses a tiny subset containing 400 questions.
- SAT : A challenging real-image dynamic spatial benchmark.
- EgoPlan-Bench2 : A benchmark which encompasses everyday tasks spanning4 major domains and 24 detailed scenarios, closely aligned with human daily life.
- ERQA : This evaluation benchmark covers a variety of topics related to spatial reasoning and world knowledge focused on real-world scenarios, particularly in the context of robotics.
数据集子集链接 :comming soon
Details and logs
You can find:
- detailed numerical results in the results Hugging Face dataset: EmbodiedVerse_results
- community queries and running status in the requests Hugging Face dataset: EmbodiedVerse_requests