Co-Me: Confidence-Guided Token Merging for Visual Geometric Transformers

1

2

Abstract

We propose Confidence-Guided Token Merging (Co-Me), an acceleration mechanism for visual geometric transformers without retraining or finetuning the base model. Co-Me employs a light-weight distilled confidence predictor to rank tokens and selectively merge low-confidence ones, effectively reducing computation while maintaining spatial coverage. Compared to similarity-based merging or pruning, the confidence signal in Co-Me reliably indicates regions emphasized by the transformer, enabling substantial acceleration without degrading performance. Co-Me applies seamlessly to various multi-view and streaming visual geometric transformers, achieving speedups that scale with sequence length. When applied to VGGT and MapAnything, Co-Me achieves up to $11.3\times$ and $7.2\times$ speedup, making visual geometric transformers practical for real-time 3D perception and reconstruction.

Motivation

Result

Real-world Demo

Qualitative Comparison

BeforeAfter
Original VGGT
Co-Me Accelerated
BeforeAfter
Original VGGT
Co-Me Accelerated
BeforeAfter
Original VGGT
Co-Me Accelerated
BeforeAfter
Original VGGT
Co-Me Accelerated
BeforeAfter
Original MapAnything
Co-Me Accelerated
BeforeAfter
Original MapAnything
Co-Me Accelerated
BeforeAfter
Original MapAnything
Co-Me Accelerated
BeforeAfter
Original MapAnything
Co-Me Accelerated

Methods

Figure 1. Overview of Co-Me. A lightweight module distilled from the frozen ViT backbone predicts per-token confidence from intermediate features. The resulting confidence is converted into a binary mask that guides token merging on the attention and MLP modules.
Figure 2. The proposed mask generation (left), merge (middle), and split (right) operators. Each sample generates an individual merge mask via confidence ranking and bottom-$p$ selection. A shared index map is used by merging and splitting, which aggregate (average or copy) and restore image tokens while preserving special tokens. Our efficient implementation supports varying merging masks across samples in the batch as long as the number of merged tokens remains consistent.

Citation

@misc{chen2025comeconfidenceguidedtokenmerging,
      title={Co-Me: Confidence-Guided Token Merging for Visual Geometric Transformers}, 
      author={Yutian Chen and Yuheng Qiu and Ruogu Li and Ali Agha and Shayegan Omidshafiei and Jay Patrikar and Sebastian Scherer},
      year={2025},
      eprint={2511.14751},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2511.14751}, 
}