Do Vision and Text Cues Exhibit Evidential Coupling?
UFO: A Benchmark for Compositional Multimodal Reasoning in Unified Models

Zhongyu Yang, Dannong Xu*, Yonghan Zhang*, Kefan Chen*, Xinyi Wang*, Yang Xu, Wei Pang, Yingfang Yuan
ICML 2026
1BCML, Heriot-Watt University    2INSAIT    3Southern University of Science and Technology

*Equal contribution

Corresponding author
UFO tasks: 3 categories x 10 tasks

UFO spans three state-transition regimes — State Determination, State Reconstruction, and State Augmentation — across 10 tasks (Hybridisation, Chemical, Multi-table, Multi-view, Inpainting, Exo-to-Ego, Jigsaw, Geometric, Logical, Physics).

Abstract

Unified Foundation Models (UFMs), which support interleaved multimodal generation and understanding, have been proposed as a promising paradigm for reasoning about dynamic world states, yet it remains unclear whether the visual content they generate functions as grounded evidence for subsequent reasoning or merely as auxiliary output. Existing benchmarks largely evaluate generation and understanding as separate capabilities and do not test their functional dependence during reasoning.

We introduce UFO, a benchmark designed to evaluate whether UFMs generate and use image and text cues as evidence for compositional multimodal reasoning. UFO spans three cue types — state determination, state reconstruction, and state augmentation — which correspond to progressively smaller transformations of the underlying world state. Our analysis reveals a significant modality gap: models often achieve high prediction accuracy even when the generated visual cues exert limited influence on their decisions, indicating weakened evidential coupling and a reliance on textual shortcuts rather than robust cross-modal grounding.

How UFO works

UFO reasoning framework

Two-step reasoning under four protocols. Given the input images and a question, a model first generates intermediate textual and visual cues describing a future state, then answers the question — alone (direct) or conditioned on those cues (textual, visual, joint). Genuine cross-modal synergy appears as joint > unimodal; reliance on a single-modality prior appears otherwise.

Case Studies

For each task, we compare the model's generated textual and visual cues against the ground truth, and trace how they drive the final answer. Slide through all 10 tasks.

BibTeX

@inproceedings{ufo2026,
  title     = {Do Vision and Text Cues Exhibit Evidential Coupling?
               UFO: A Benchmark for Compositional Multimodal Reasoning in Unified Models},
  author    = {Yang, Zhongyu and Xu, Dannong and Zhang, Yonghan and Chen, Kefan and
               Wang, Xinyi and Xu, Yang and Pang, Wei and Yuan, Yingfang},
  booktitle = {International Conference on Machine Learning (ICML)},
  year      = {2026}
}