Model | Perception Reasoning | ||||||||
---|---|---|---|---|---|---|---|---|---|
Robotic-centric | Object-centric | Scene-centric | Task-centric | Avg | |||||
Robot-type▼ | Robot-view▼ | Static Attr.▼ | Functional Attr.▼ | Spatial Relation▼ | Temp. Grounding▼ | Causality▼ | Refer. Comprehen.▼ | ||
Basic Reference | |||||||||
Human Evaluation | 80.67 | 79.08 | 43.77 | 83.89 | 70.91 | 51.61 | 91.22 | 93.22 | 74.30 |
GPT-4o-text-only | 20.51 | 13.77 | 5.18 | 35.37 | 25.74 | 18.32 | 25.52 | 22.09 | 20.81 |
Closed-Source MLLMs | |||||||||
GPT-4o-Mini | 38.75 | 18.84 | 26.43 | 53.66 | 30.36 | 22.65 | 34.25 | 39.67 | 33.08 |
GPT-4o | 64.96 | 39.38 | 24.92 | 46.75 | 42.24 | 20.61 | 33.10 | 41.31 | 39.16 |
Claude-3.5-Sonnet | 41.31 | 36.23 | 29.13 | 62.60 | 34.98 | 21.88 | 36.09 | 25.36 | 35.95 |
Claude-3.7-Sonnet | 40.46 | 32.37 | 45.20 | 71.14 | 36.63 | 21.09 | 40.92 | 28.02 | 39.48 |
Gemini-2.0-Flash | 56.69 | 20.77 | 49.08 | 78.46 | 42.57 | 21.37 | 51.72 | 72.40 | 49.13 |
Gemini-2.5-Flash | 62.39 | 39.38 | 55.02 | 77.24 | 57.43 | 33.58 | 70.34 | 74.64 | 58.75 |
Gemini-2.5-Pro | 64.30 | 41.71 | 54.83 | 82.27 | 60.44 | 49.68 | 71.73 | 78.68 | 62.96 |
Qwen-VL-Plus | 28.21 | 21.74 | 34.63 | 58.54 | 27.72 | 21.37 | 31.03 | 34.36 | 32.20 |
Qwen-VL-Max | 47.86 | 43.48 | 39.70 | 75.20 | 50.17 | 27.45 | 37.93 | 41.53 | 45.42 |
Open-Source Multi-Image MLLMs | |||||||||
LLaVA-OneVision-0.5B | 30.34 | 23.68 | 37.08 | 49.66 | 27.27 | 18.42 | 23.65 | 19.21 | 28.66 |
LLaVA-OneVision-7B | 44.83 | 30.26 | 33.43 | 75.84 | 45.45 | 23.68 | 25.68 | 44.63 | 40.48 |
Qwen2.5-VL-7B-Ins | 23.93 | 26.81 | 37.86 | 46.34 | 31.68 | 22.90 | 14.48 | 36.81 | 30.10 |
Qwen2.5-VL-72B-Ins | 47.72 | 42.75 | 41.74 | 72.95 | 48.51 | 27.87 | 40.32 | 42.13 | 45.50 |
Embodied MLLMs | |||||||||
RoboBrain-2.0-7B | 44.97 | 24.84 | 40.43 | 79.19 | 48.18 | 23.48 | 41.22 | 53.67 | 44.50 |
Model | Instruction Comprehension | Generalized Planning | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Explicit▼ | Implicit▼ | Avg▼ | Cross-Embodiment Planning | Cross-Object Planning | Cross-View Planning | Cross-Task Planning | Avg | |||||||
Single-arm▼ | Dual-arm▼ | Mobile-manip.▼ | Human▼ | Material Afford.▼ | Physical Attr.▼ | World Knowl.▼ | Multi▼ | Single▼ | Navigation Plan.▼ | |||||
Basic Reference | ||||||||||||||
Human Evaluation | 59.94 | 61.13 | 60.54 | 72.50 | 41.93 | 41.55 | 62.28 | 56.70 | 58.98 | 49.36 | 52.82 | 51.59 | 45.23 | 54.50 |
GPT-4o-text-only | 38.80 | 11.10 | 24.95 | 26.70 | 33.32 | 43.65 | 37.86 | 36.58 | 22.33 | 37.68 | 44.35 | 38.11 | 36.90 | 33.95 |
Closed-Source MLLMs | ||||||||||||||
GPT-4o-Mini | 41.21 | 14.95 | 28.08 | 27.47 | 25.21 | 37.98 | 31.72 | 33.75 | 38.46 | 42.56 | 39.11 | 33.29 | 34.04 | 33.31 |
GPT-4o | 45.60 | 19.04 | 32.32 | 28.28 | 32.65 | 52.69 | 35.71 | 39.93 | 46.09 | 41.34 | 38.51 | 33.66 | 39.41 | 37.74 |
Claude-3.5-Sonnet | 42.11 | 14.85 | 28.48 | 30.18 | 33.65 | 50.29 | 41.05 | 38.28 | 40.67 | 39.63 | 45.95 | 40.43 | 39.77 | 38.07 |
Claude-3.7-Sonnet | 47.77 | 14.53 | 31.15 | 29.86 | 38.69 | 50.39 | 37.06 | 38.65 | 41.86 | 51.83 | 48.19 | 44.51 | 39.95 | 41.68 |
Gemini-2.0-Flash | 43.49 | 16.38 | 29.93 | 28.67 | 33.66 | 48.27 | 33.95 | 40.76 | 54.27 | 40.12 | 46.13 | 40.73 | 37.02 | 38.62 |
Gemini-2.5-Flash | 42.53 | 17.10 | 29.82 | 27.05 | 40.46 | 49.91 | 34.50 | 39.87 | 53.37 | 46.22 | 39.41 | 43.29 | 38.32 | 39.33 |
Gemini-2.5-Pro | 51.15 | 19.60 | 35.37 | 29.71 | 37.65 | 50.96 | 37.44 | 39.29 | 56.50 | 43.29 | 47.35 | 45.12 | 43.62 | 41.81 |
Qwen-VL-Plus | 37.77 | 10.38 | 24.07 | 24.68 | 21.75 | 32.98 | 33.91 | 28.45 | 33.55 | 33.78 | 30.95 | 28.60 | 4.39 | 26.77 |
Qwen-VL-Max | 46.45 | 16.98 | 31.71 | 28.30 | 35.73 | 47.79 | 32.40 | 40.44 | 44.33 | 42.32 | 41.79 | 37.68 | 38.00 | |
Open-Source Multi-Image MLLMs | ||||||||||||||
LLaVA-OneVision-0.5B | 6.82 | 1.24 | 3.61 | 2.90 | 4.57 | 4.77 | 3.68 | 4.77 | 3.47 | 6.47 | 4.30 | 3.62 | 11.39 | 4.83 |
LLaVA-OneVision-7B | 18.93 | 3.48 | 10.05 | 11.48 | 16.23 | 8.27 | 5.34 | 18.51 | 15.62 | 8.10 | 0.00 | 15.16 | 24.67 | 12.15 |
Qwen2.5-VL-7B-Ins | 26.45 | 4.65 | 15.55 | 19.47 | 12.90 | 28.75 | 28.19 | 22.06 | 21.63 | 25.61 | 11.79 | 20.12 | 2.10 | 18.64 |
Qwen2.5-VL-72B-Ins | 46.81 | 15.15 | 30.98 | 28.20 | 36.92 | 49.14 | 31.31 | 40.51 | 44.94 | 38.90 | 43.16 | 40.24 | 37.47 | 37.73 |
Embodied MLLMs | ||||||||||||||
RoboBrain-2.0-7B | 36.93 | 8.19 | 22.51 | 15.46 | 25.32 | 32.72 | 31.81 | 19.85 | 30.85 | 23.24 | 31.51 | 23.89 | 24.53 | 25.35 |
Model | Affordance Prediction | Failure Analysis | |||||
---|---|---|---|---|---|---|---|
Static▼ | Dynamic▼ | Naviga.▼ | Avg▼ | Execution▼ | Planning▼ | Avg▼ | |
Basic Reference | |||||||
Human Evaluation | 86.08 | 80.02 | 81.85 | 82.63 | 47.30 | 80.67 | 63.99 |
GPT-4o-text-only | 44.89 | 40.70 | 38.19 | 39.88 | 25.17 | 37.93 | 31.55 |
Closed-Source MLLMs | |||||||
GPT-4o-Mini | 50.64 | 42.88 | 42.30 | 46.39 | 17.66 | 44.60 | 31.13 |
GPT-4o | 55.61 | 49.14 | 49.91 | 51.91 | 22.29 | 57.01 | 39.65 |
Claude-3.5-Sonnet | 56.26 | 54.25 | 53.84 | 54.77 | 16.12 | 47.52 | 31.82 |
Claude-3.7-Sonnet | 60.02 | 52.38 | 50.07 | 54.06 | 18.32 | 54.24 | 36.28 |
Gemini-2.0-Flash | 61.65 | 61.76 | 66.89 | 63.37 | 28.48 | 59.80 | 44.14 |
Gemini-2.5-Flash | 61.20 | 52.04 | 52.01 | 54.29 | 18.54 | 67.65 | 43.10 |
Gemini-2.5-Pro | 70.54 | 62.03 | 63.96 | 65.21 | 15.96 | 74.31 | 45.14 |
Qwen-VL-Plus | 51.74 | 37.42 | 47.97 | 48.18 | 13.91 | 40.00 | 26.96 |
Qwen-VL-Max | 70.01 | 56.26 | 50.85 | 59.43 | 17.22 | 57.93 | 37.58 |
Open-Source Multi-Image MLLMs | |||||||
LLaVA-OneVision-0.5B | 20.56 | 28.56 | 27.69 | 24.76 | 21.19 | 24.67 | 22.93 |
LLaVA-OneVision-7B | 23.83 | 33.61 | 33.43 | 30.29 | 29.14 | 34.00 | 31.56 |
Qwen2.5-VL-7B-Ins | 49.73 | 38.03 | 42.16 | 43.15 | 13.91 | 26.90 | 20.41 |
Qwen2.5-VL-72B-Ins | 71.54 | 51.94 | 47.67 | 56.67 | 12.59 | 50.72 | 31.66 |
Embodied MLLMs | |||||||
RoboBrain-2.0-7B | 51.87 | 54.63 | 41.61 | 49.37 | 7.95 | 42.00 | 41.24 |
Gemini-2.5-Pro achieves the strongest overall performance across all five cognitive dimensions. It notably scores 62.96 in perception reasoning and 65.21 / 45.14 in affordance and failure analysis—well above other models but still far below human levels (74.30 / 82.63 / 63.99). This underscores a persistent gap between current MLLMs and robust human-level embodied intelligence.
Closed-source MLLMs outperform open-source ones in four out of five dimensions, often by 10–15%. Open-source models only approach parity in perception reasoning. Within each family, larger models consistently perform better, e.g., GPT-4o > GPT-4o-mini, Claude-3.7 > Claude-3.5.
The embodied MLLM RoboBrain-2.0-7B surpasses similarly sized general open-source models in perception reasoning, planning, and affordance prediction. This validates the effectiveness of domain-specific embodied datasets for improving multimodal reasoning and planning.
Perception reasoning yields the highest accuracies; Generalized planning remains the most challenging, exposing weaknesses in long-horizon reasoning and structured task decomposition. The contrast highlights where future progress is most needed.
Performance on implicit instructions drops by about 30% compared to explicit ones. Models struggle to infer goals from indirect human demands, revealing weak integration of language, perception, and context.
Models misidentify robot types or viewpoints and fail to localize events in time. Temporal and causal reasoning accuracies hover around 30–40%, except for Gemini series. Stronger embodiment-aware perception and spatiotemporal reasoning modules are needed.
Cross-embodiment: poor coordination in dual-arm or mobile manipulation. Cross-object: difficulty with rare or knowledge-dependent objects. Cross-view: multi-image inputs markedly improve performance (e.g., +5–7 points for GPT-4o / Claude-3.7), showing the promise of multi-view reasoning.
Diagnosing execution-level errors is far more difficult than planning-level ones (scores 10–20 vs. 40–60). Requires fine-grained spatial and physical understanding (e.g., distinguishing location vs. rotation errors). Even humans achieve only 47.3 on such tasks, underscoring their intrinsic complexity.
Planning Evaluation Framework. Evaluation of the planning dimension (Q1–Q3). Each task is decomposed into a sequence of parameterized atomic actions forming a Directed Acyclic Graph (DAG) that encodes causal and temporal dependencies. For Q1 (Long-horizon planning), an MLLM-based world simulator assesses both NodeCorrectness (action alignment) and TaskCompletion (goal-state achievement) by simulating action rollouts under visual and physical constraints. Q2 (Next-step planning) evaluates fine-grained step prediction by comparing skill, object, and parameter accuracy, while Q3 (Task state estimation) measures binary correctness on whether a subtask has been completed. Together, the pipeline provides a unified, interpretable framework for assessing structural correctness and embodied feasibility in planning.
@misc{luo2025robobenchcomprehensiveevaluationbenchmark,
title={Robobench: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models as Embodied Brain},
author={Yulin Luo and Chun-Kai Fan and Menghang Dong and Jiayu Shi and Mengdi Zhao and Bo-Wen Zhang and Cheng Chi and Jiaming Liu and Gaole Dai and Rongyu Zhang and Ruichuan An and Kun Wu and Zhengping Che and Shaoxuan Xie and Guocai Yao and Zhongxia Zhao and Pengwei Wang and Guang Liu and Zhongyuan Wang and Tiejun Huang and Shanghang Zhang},
year={2025},
eprint={2510.17801},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2510.17801},
}