Do Vision and Text Cues Exhibit Evidential Coupling?
UFO: A Benchmark for Compositional Multimodal Reasoning in Unified Models

Zhongyu Yang, Dannong Xu^*, Yonghan Zhang^*, Kefan Chen^*, Xinyi Wang^*, Yang Xu, Wei Pang, Yingfang Yuan^†

ICML 2026

¹BCML, Heriot-Watt University ²INSAIT ³Southern University of Science and Technology

^*Equal contribution
^†Corresponding author

Paper (OpenReview) Code 🤗 Dataset

UFO spans three state-transition regimes — State Determination, State Reconstruction, and State Augmentation — across 10 tasks (Hybridisation, Chemical, Multi-table, Multi-view, Inpainting, Exo-to-Ego, Jigsaw, Geometric, Logical, Physics).

Abstract

Unified Foundation Models (UFMs), which support interleaved multimodal generation and understanding, have been proposed as a promising paradigm for reasoning about dynamic world states, yet it remains unclear whether the visual content they generate functions as grounded evidence for subsequent reasoning or merely as auxiliary output. Existing benchmarks largely evaluate generation and understanding as separate capabilities and do not test their functional dependence during reasoning.

We introduce UFO, a benchmark designed to evaluate whether UFMs generate and use image and text cues as evidence for compositional multimodal reasoning. UFO spans three cue types — state determination, state reconstruction, and state augmentation — which correspond to progressively smaller transformations of the underlying world state. Our analysis reveals a significant modality gap: models often achieve high prediction accuracy even when the generated visual cues exert limited influence on their decisions, indicating weakened evidential coupling and a reliance on textual shortcuts rather than robust cross-modal grounding.

How UFO works

Two-step reasoning under four protocols. Given the input images and a question, a model first generates intermediate textual and visual cues describing a future state, then answers the question — alone (direct) or conditioned on those cues (textual, visual, joint). Genuine cross-modal synergy appears as joint > unimodal; reliance on a single-modality prior appears otherwise.

Case Studies

For each task, we compare the model's generated textual and visual cues against the ground truth, and trace how they drive the final answer. Slide through all 10 tasks.

Hybridisation · State Determination

Chemical · State Determination

Multi-table · State Determination

Multi-view · State Determination

Inpainting · State Reconstruction

Exo-to-Ego · State Reconstruction

Jigsaw · State Reconstruction

Geometric · State Augmentation

Logical · State Augmentation

Physics · State Augmentation

BibTeX

@inproceedings{ufo2026,
  title     = {Do Vision and Text Cues Exhibit Evidential Coupling?
               UFO: A Benchmark for Compositional Multimodal Reasoning in Unified Models},
  author    = {Yang, Zhongyu and Xu, Dannong and Zhang, Yonghan and Chen, Kefan and
               Wang, Xinyi and Xu, Yang and Pang, Wei and Yuan, Yingfang},
  booktitle = {International Conference on Machine Learning (ICML)},
  year      = {2026}
}

More Works from Our Lab

Script: Graph-Structured and Query-Conditioned Semantic Token Pruning for MLLMs

MERMAID: Multi-perspective Self-reflective Agents with Generative Augmentation for Emotion Recognition

Do Vision and Text Cues Exhibit Evidential Coupling?UFO: A Benchmark for Compositional Multimodal Reasoning in Unified Models