XR: Cross-Modal Agents for Composed Image Retrieval

WWW 2026
1BCML, Heriot-Watt University,

Corresponding author
MY ALT TEXT

The workflows of existing CIR methods and ours: (a) Joint embedding–based methods encode a multimodal query into a shared space, but they often struggle to capture complex text-specified edits. (b) Caption-to-Image methods first generate a target caption from the multimodal query prior to retrieval, but they often fail to preserve fine-grained details. (c) Caption-to-Caption methods build upon Caption-to-Image but restrict comparison to the text space, thereby discarding visual cues. (d) X^R(ours) introduces an agentic AI framework with cross-modal agents and a progressive retrieval process consisting of an imagination stage followed by coarse-to-fine filtering, enabling robust reasoning that better aligns results with user intent.

Abstract

Retrieval is being redefined by agentic AI, demanding multimodal reasoning beyond conventional similarity-based paradigms. Composed Image Retrieval (CIR) exemplifies this shift as each query combines a reference image with textual modifications, requiring compositional understanding across modalities. While embedding-based CIR methods have achieved progress, they remain narrow in perspective, capturing limited cross-modal cues and lacking semantic reasoning. To address these limitations, we introduce $\text{X}^\text{R}$, a training-free multi-agent framework that reframes retrieval as a progressively coordinated reasoning process. It orchestrates three specialized types of agents: imagination agents synthesize target representations through cross-modal generation, similarity agents perform coarse filtering via hybrid matching, and question agents verify factual consistency through targeted reasoning for fine filtering. Through progressive multi-agent coordination, $\text{X}^\text{R}$ iteratively refines retrieval to meet both semantic and visual query constraints, achieving up to a 38% gain over strong training-free and training-based baselines on FashionIQ, CIRR, and CIRCO, while ablations show each agent is essential.

Method

MY ALT TEXT

Framework of $\text{X}^\text{R}$. The multi-agent system integrates textual and visual imagination with cross-modal similarity and question-based scoring, followed by re-ranking. This multi-stage reasoning process exploits complementary cues from both modalities, effectively handling fine-grained modifications that single-modality approaches often miss.

Method

MY ALT TEXT

Our Script Framework. a three-stage pruning framework: (a) overall architecture; (b) Query-Conditioned Semantic Pruning (QCSP); (c) Graph-Structured Pruning (GSP). Together, these modules remove semantically irrelevant and visually redundant tokens through a joint selection process.

BibTeX

BibTex Code Here