PAPER_TITLE

FIRST_AUTHOR_LAST, FIRST_AUTHOR_FIRST; SECOND_AUTHOR_LAST, SECOND_AUTHOR_FIRST

XR: Cross-Modal Agents for Composed Image Retrieval

Zhongyu Yang¹ Wei Pang¹, Yingfang Yuan^1,†

WWW 2026

¹BCML, Heriot-Watt University,

^†Corresponding author

Paper Code arXiv

The workflows of existing CIR methods and ours: (a) Joint embedding–based methods encode a multimodal query into a shared space, but they often struggle to capture complex text-specified edits. (b) Caption-to-Image methods first generate a target caption from the multimodal query prior to retrieval, but they often fail to preserve fine-grained details. (c) Caption-to-Caption methods build upon Caption-to-Image but restrict comparison to the text space, thereby discarding visual cues. (d) X^R(ours) introduces an agentic AI framework with cross-modal agents and a progressive retrieval process consisting of an imagination stage followed by coarse-to-fine filtering, enabling robust reasoning that better aligns results with user intent.

Abstract

Retrieval is being redefined by agentic AI, demanding multimodal reasoning beyond conventional similarity-based paradigms. Composed Image Retrieval (CIR) exemplifies this shift as each query combines a reference image with textual modifications, requiring compositional understanding across modalities. While embedding-based CIR methods have achieved progress, they remain narrow in perspective, capturing limited cross-modal cues and lacking semantic reasoning. To address these limitations, we introduce $\text{X}^\text{R}$, a training-free multi-agent framework that reframes retrieval as a progressively coordinated reasoning process. It orchestrates three specialized types of agents: imagination agents synthesize target representations through cross-modal generation, similarity agents perform coarse filtering via hybrid matching, and question agents verify factual consistency through targeted reasoning for fine filtering. Through progressive multi-agent coordination, $\text{X}^\text{R}$ iteratively refines retrieval to meet both semantic and visual query constraints, achieving up to a 38% gain over strong training-free and training-based baselines on FashionIQ, CIRR, and CIRCO, while ablations show each agent is essential.

Method

Framework of $\text{X}^\text{R}$. The multi-agent system integrates textual and visual imagination with cross-modal similarity and question-based scoring, followed by re-ranking. This multi-stage reasoning process exploits complementary cues from both modalities, effectively handling fine-grained modifications that single-modality approaches often miss.

Method

Our Script Framework. a three-stage pruning framework: (a) overall architecture; (b) Query-Conditioned Semantic Pruning (QCSP); (c) Graph-Structured Pruning (GSP). Together, these modules remove semantically irrelevant and visually redundant tokens through a joint selection process.

Case study on CIRR. $\text{X}^\text{R}$ correctly grounds complex scene edits (e.g., bus orientation, reflective jackets) through factual verification. The target image is marked with the green box.

Case study on FashionIQ. $\text{X}^\text{R}$ captures subtle attribute edits (e.g., tone, lettering) and validates them via text-based questioning. The target image is marked with the green box.

BibTeX

BibTex Code Here

More Works from Our Lab

WikiAutoGen: Towards Multi-Modal Wikipedia-Style Article Generation

InEx: Hallucination Mitigation via Introspection and Cross-Modal Multi-Agent Collaboration

Script: Graph-Structured and Query-Conditioned Semantic Token Pruning for Multimodal Large Language Models

MERMAID: Multi-perspective Self-reflective Agents with Generative Augmentation for Emotion Recognition

XR: Cross-Modal Agents for Composed Image Retrieval

Abstract

Method

Method

Our Script Framework. a three-stage pruning framework: (a) overall architecture; (b) Query-Conditioned Semantic Pruning (QCSP); (c) Graph-Structured Pruning (GSP). Together, these modules remove semantically irrelevant and visually redundant tokens through a joint selection process.

Case study on CIRR. $\text{X}^\text{R}$ correctly grounds complex scene edits (e.g., bus orientation, reflective jackets) through factual verification. The target image is marked with the green box.

Case study on FashionIQ. $\text{X}^\text{R}$ captures subtle attribute edits (e.g., tone, lettering) and validates them via text-based questioning. The target image is marked with the green box.

BibTeX