PAPER_TITLE

FIRST_AUTHOR_LAST, FIRST_AUTHOR_FIRST; SECOND_AUTHOR_LAST, SECOND_AUTHOR_FIRST

Script: Graph-Structured and Query-Conditioned Semantic Token Pruning for Multimodal Large Language Models

Zhongyu Yang^1,* Dannong Xu^2,*, Wei Pang¹, Yingfang Yuan^1,†

TMLR 2025

¹BCML, Heriot-Watt University,
²University of Sydney

^*Equal contribution
^†Corresponding author

Paper Code arXiv

Comparison of different token pruning methods. Attention-based and similarity-based methods prune tokens using attention scores and similarity scores, respectively. In contrast, divergence-based methods detect changes in model performance and retain tokens that cause minimal impact. Script (Graph-Structured and Query-Conditioned Token Pruning) combines graph-structured reduction of visual redundancy and query-conditioned semantic token selection to enable efficient pruning in MLLMs. In this example, Script successfully preserves key visual cues, such as the silver pot on the stove, the pineapple beside the limes, and the flowers on the table. Other methods fail to retain consistently.

Abstract

The rapid growth of visual tokens in multimodal large language models (MLLMs) leads to excessive memory consumption and inference latency, especially when handling high-resolution images and videos. Token pruning is a technique used to mitigate this issue by removing redundancy, but existing methods often ignore relevance to the user query or suffer from the limitations of attention mechanisms, reducing their adaptability and effectiveness. To address these challenges, we propose Script, a plug-and-play pruning method that requires no retraining and generalizes across diverse MLLMs. Script comprises two modules: a graph-structured pruning module that removes visually redundant tokens, and a query-conditioned semantic pruning module that preserves query-relevant visual information. Together, they enhance performance on multimodal tasks. Experiments on fourteen benchmarks across image and video understanding tasks show that Script consistently achieves higher model efficiency and predictive accuracy compared to existing pruning methods. On LLaVA-NeXT-7B, it achieves up to prefill speedup and FLOP reduction, while retaining 96.88\% of the original performance. Code will be made publicly available upon acceptance.

Method

Our Script Framework. a three-stage pruning framework: (a) overall architecture; (b) Query-Conditioned Semantic Pruning (QCSP); (c) Graph-Structured Pruning (GSP). Together, these modules remove semantically irrelevant and visually redundant tokens through a joint selection process.

(a) Efficiency Analysis on LLaVA-NeXT-7B under 88.9\% reduction. (b) Comparison with other baselines on LLaVA-1.5-7B under 94.4\% reduction.

Token redundancy visualized via similarity and entropy on 10,000 COCO images.

BibTeX

BibTex Code Here

More Works from Our Lab

WikiAutoGen: Towards Multi-Modal Wikipedia-Style Article Generation

MERMAID: Multi-perspective Self-reflective Agents with Generative Augmentation for Emotion Recognition