Script: Graph-Structured and Query-Conditioned Semantic Token Pruning for Multimodal Large Language Models

TMLR 2025
1BCML, Heriot-Watt University,
2University of Sydney

*Equal contribution

Corresponding author
MY ALT TEXT

Comparison of different token pruning methods. Attention-based and similarity-based methods prune tokens using attention scores and similarity scores, respectively. In contrast, divergence-based methods detect changes in model performance and retain tokens that cause minimal impact. Script (Graph-Structured and Query-Conditioned Token Pruning) combines graph-structured reduction of visual redundancy and query-conditioned semantic token selection to enable efficient pruning in MLLMs. In this example, Script successfully preserves key visual cues, such as the silver pot on the stove, the pineapple beside the limes, and the flowers on the table. Other methods fail to retain consistently.

Abstract

The rapid growth of visual tokens in multimodal large language models (MLLMs) leads to excessive memory consumption and inference latency, especially when handling high-resolution images and videos. Token pruning is a technique used to mitigate this issue by removing redundancy, but existing methods often ignore relevance to the user query or suffer from the limitations of attention mechanisms, reducing their adaptability and effectiveness. To address these challenges, we propose Script, a plug-and-play pruning method that requires no retraining and generalizes across diverse MLLMs. Script comprises two modules: a graph-structured pruning module that removes visually redundant tokens, and a query-conditioned semantic pruning module that preserves query-relevant visual information. Together, they enhance performance on multimodal tasks. Experiments on fourteen benchmarks across image and video understanding tasks show that Script consistently achieves higher model efficiency and predictive accuracy compared to existing pruning methods. On LLaVA-NeXT-7B, it achieves up to prefill speedup and FLOP reduction, while retaining 96.88\% of the original performance. Code will be made publicly available upon acceptance.

Method

MY ALT TEXT

Our Script Framework. a three-stage pruning framework: (a) overall architecture; (b) Query-Conditioned Semantic Pruning (QCSP); (c) Graph-Structured Pruning (GSP). Together, these modules remove semantically irrelevant and visually redundant tokens through a joint selection process.

BibTeX

BibTex Code Here