CVPR 2026

M3Grounder: Mask-Based Multi-Span and Multi-Granular Grounding for Document QA

1 BharatGen     2 IIIT Hyderabad
* Equal contribution     Equal contribution
Code 🤗HF Space Dataset Soon 🤗Models Soon arXiv Soon
M3Grounder teaser figure

M3Grounder in action: Each example shows a QA pair. The predicted answer text contains interleaved [GROUND] tokens which map answer spans to corresponding grounding regions. A, B demonstrate precise segmentation without spillover into irrelevant regions. B shows effective grounding for dense, multi-span evidence in complex document layouts. C1, C2, C3 illustrate multi-granular grounding, where the grounding scope expands hierarchically (phrase \( \subset \) line \( \subset \) block).

Abstract

Document QA requires not only accurate answers but also identifying where each answer is grounded on the page. Most approaches treat the task as text-only generation, while existing answer grounding methods generate coarse bounding boxes that fail to capture curved text. We introduce M3Grounder, a hybrid vision-language and segmentation architecture that formulates document grounding as pixel-level segmentation.

M3Grounder produces fine-grained evidence masks refined by a bleed-suppression loss to prevent spillover. It autoregressively generates answer text interleaved with [GROUND] tokens that link individual answer spans to their corresponding evidence regions. M3Grounder also grounds evidence hierarchically across phrase, line, and block levels using an enclosure loss that enforces spatial containment (phrase ⊂ line ⊂ block).

We release GroundingDocQA, a large-scale dataset of 200K documents and 2M multi-span and multi-granular QA pairs with pixel-level grounding masks, built through a data engine handling complex layouts, curved-text, and graphic-rich documents. We also release GroundingDocQA-Bench, a diverse human-verified benchmark. M3Grounder sets a new state-of-the-art in grounded DocVQA, advancing from coarse boxes to hierarchical, fine-grained and contextually grounded mask evidence.

Method

M3Grounder architecture overview

Overview of M3Grounder. Given a document image and question, a VLM (1) encodes visual features and autoregressively generates an answer sequence (2) interleaved with [GROUND] tokens. The hidden states of [GROUND] tokens (3) are projected by three granularity-specific MLP heads into phrase-, line-, and block-level prompt embeddings (4). A promptable segmentation module (5) extracts dense image embeddings reused across all spans and granularities. The prompt embeddings are decoded by a mask decoder to produce hierarchical grounding masks (6), mapping each answer span to its evidence at all granularities.

🔗 Decoupled Language & Spatial Grounding

Unlike prior methods that interleave bounding-box coordinates inside generated text, M3Grounder decouples language modeling from spatial prediction. The VLM focuses on semantically accurate answers and span boundaries, while a dedicated segmentation head handles fine-grained, geometry-aware localization — yielding cleaner text outputs and more precise masks.

🔢 Autoregressive Multi-Span Grounding

M3Grounder autoregressively generates answer text with special [GROUND] tokens emitted immediately after each answer span, maintaining a one-to-one span-to-region link. This naturally handles multi-span answers drawing evidence from multiple, spatially distinct document regions in a single forward pass.

Multi-Granular Hierarchical Grounding

Documents follow a natural spatial hierarchy. M3Grounder jointly predicts evidence at three levels, enforced by a hierarchical enclosure loss.

🔤

Phrase Level

Fine-grained spans for extractive answers — names, dates, numeric values

📏

Line Level

Full text lines capturing neighboring context and reading-order cues

🔲

Block Level

Broader regions for abstractive or summary questions requiring wide context

Hierarchical Enclosure Loss enforces strict nesting: phrase ⊂ line ⊂ block
Multi-granular grounding example on a bank statement

Multi-granular grounding on financial documents. M3Grounder grounds answer evidence at three hierarchical levels. C1 Phrase-level masks highlight the precise token spans corresponding to extracted values. C2 Line-level masks capture the surrounding transaction context, while C3 Block-level grounding retrieves the broader statement region containing the relevant entries.

Training Objectives

M3Grounder is trained end-to-end with four complementary loss terms:

𝓛total = λlm𝓛lm + λseg𝓛seg + λbleed𝓛bleed + λhier𝓛hier

𝓛lm — Language Modeling

Cross-entropy over all output tokens for accurate QA generation and span boundary prediction.

𝓛seg — Segmentation (Dice + BCE)

Supervises mask predictions at each granularity level against pixel-level ground truth.

𝓛bleed — Bleed Suppression

Penalizes mask overlap beyond text pixels, preventing spillover into irrelevant background regions.

𝓛hier — Hierarchical Enclosure

Enforces phrase ⊂ line ⊂ block nesting, stabilizing multi-granular grounding spatially.

Qualitative Results

M3Grounder produces precise, hierarchical segmentation masks across diverse document types — scanned documents, forms, charts, and curved-text banners.

Precise multi-span grounding in financial reports, linking narrative claims to both textual evidence and tabular values.

Medical report reasoning with grounded evidence, connecting abnormal lab measurements to diagnostic interpretation.

Dense multi-span grounding in legal documents, retrieving multiple clauses that jointly satisfy a query.

Structured evidence grounding in academic records, extracting identifiers and performance indicators across table rows.

GroundingDocQA Dataset & Benchmark

Dataset page coming soon. We are preparing the GroundingDocQA dataset and benchmark for public release. Check back here or watch the GitHub repo for updates.

📦 GroundingDocQA

Large-scale training dataset built with a novel data engine that handles complex layouts, curved-text, and graphic-rich documents — annotated with pixel-level masks at three granularities.

200K
Documents
2M
QA Pairs
Multi-
Span
Annotation
3
Mask Levels
(phrase, line, block)
Text-rich docs Charts Forms & tables Curved text Webpages Reports

📊 GroundingDocQA-Bench

Human-verified benchmark for systematic evaluation of grounding fidelity under multi-span, multi-granular, and curved-text conditions with mask-level annotations.

2.5K
Documents
5K
QA Pairs
Human
Verified
Curved
Text Split (CS)
Single-span F₁ Multi-span F₁ Curved F₁ (CS) G-Eval (AQ)
GroundingDocQA data engine pipeline

GroundingDocQA data engine. The pipeline processes layout-aware, curved-text, and chart documents, generating multi-span and multi-granular QA pairs with pixel-level mask annotations at three levels (phrase, line, block). A human-in-the-loop verification stage ensures annotation quality across all document types.

Benchmark Results

M3Grounder achieves state-of-the-art grounding performance across all evaluated benchmarks. Gains are most pronounced on curved and skewed text, where segmentation-based localization substantially outperforms bounding-box approaches.

Model Params
BD-Test F1g
DOGR-Bench F1g
MMDocBench IoU
GDQA-Bench F1g
GPT-4o - 2.3 2.9 2.5 3.4
GPT-5 - 5.4 3.6 9.3 4.5
Gemini-2.5-Pro - 70.0 59.3 49.4 43.4
Gemma-3 12B 0.5 1.77 1.6 1.5
Kimi VL 16B 8.1 7.2 3.5 0.5
mPLUG-DocOwl2 9B 1.2 0.0 0.0 0.0
InternVL3.5 8B 41.5 12.5 15.5 7.6
Qwen3-VL 8B 44.5 27.6 28.7 12.8
Kosmos 2.5 Chat 1.3B 0.7 0.0 0.0 0.0
InternVL3.5 ft. 8B 58.7 27.6 37.6 52.7
Qwen3-VL ft. 8B 62.3 35.8 43.4 60.6
M3Grounder-I 8B 77.2 69.6 65.5 71.3
M3Grounder-Q 8B 81.4 73.3 68.2 79.0

Results

M3Grounder achieves state-of-the-art grounding performance across all evaluated benchmarks. Gains are most pronounced on curved and skewed text, where segmentation-based localization substantially outperforms bounding-box approaches.

79.0
GDQA-Bench F₁
(ours)
73.3
DOGR-Bench F₁
M3Grounder-Q
87.5
Block-level F₁
M3Grounder-Q

Multi-Granularity Results on GroundingDocQA-Bench

Block-level grounding achieves the highest F₁, confirming that wider spatial context is effectively leveraged.

Model Phrase F₁ Line F₁ Block F₁
M3Grounder-I 75.2 79.8 82.5
M3Grounder-Q Best 80.1 84.3 87.5