M3Grounder: Mask-Based Multi-Span and Multi-Granular Grounding for Document QA

CVPR 2026

M3Grounder:
Mask-Based Multi-Span and Multi-Granular
Grounding for Document QA

Venkat Kesav Venna^1,*, Sai Madhusudan Gunda^2,*, Jyothi Swaroopa Jinka^2,†, Hrithik Sagar Rachakonda^2,†, Anirudh Srinivasan¹, Ravi Kiran Sarvadevabhatla^1,2

¹ BharatGen ² IIIT Hyderabad

^* Equal contribution ^† Equal contribution

Code HF Space Paper Poster Video DatasetSoon ModelsSoon

M3Grounder in action: Each example shows a QA pair. The predicted answer text contains interleaved [GROUND] tokens which map answer spans to corresponding grounding regions. A, B demonstrate precise segmentation without spillover into irrelevant regions. B shows effective grounding for dense, multi-span evidence in complex document layouts. C1, C2, C3 illustrate multi-granular grounding, where the grounding scope expands hierarchically (phrase \( \subset \) line \( \subset \) block).

Abstract

Document QA requires not only accurate answers but also identifying where each answer is grounded on the page. Most approaches treat the task as text-only generation, while existing answer grounding methods generate coarse bounding boxes that fail to capture curved text. We introduce M3Grounder, a hybrid vision-language and segmentation architecture that formulates document grounding as pixel-level segmentation.

M3Grounder produces fine-grained evidence masks refined by a bleed-suppression loss to prevent spillover. It autoregressively generates answer text interleaved with [GROUND] tokens that link individual answer spans to their corresponding evidence regions. M3Grounder also grounds evidence hierarchically across phrase, line, and block levels using an enclosure loss that enforces spatial containment (phrase ⊂ line ⊂ block).

We release GroundingDocQA, a large-scale dataset of 200K documents and 2M multi-span and multi-granular QA pairs with pixel-level grounding masks, built through a data engine handling complex layouts, curved-text, and graphic-rich documents. We also release GroundingDocQA-Bench, a diverse human-verified benchmark. M3Grounder sets a new state-of-the-art in grounded DocVQA, advancing from coarse boxes to hierarchical, fine-grained and contextually grounded mask evidence.

Method

Overview of M3Grounder. Given a document image and question, a VLM (1) encodes visual features and autoregressively generates an answer sequence (2) interleaved with [GROUND] tokens. The hidden states of [GROUND] tokens (3) are projected by three granularity-specific MLP heads into phrase-, line-, and block-level prompt embeddings (4). A promptable segmentation module (5) extracts dense image embeddings reused across all spans and granularities. The prompt embeddings are decoded by a mask decoder to produce hierarchical grounding masks (6), mapping each answer span to its evidence at all granularities.

Animated Architecture Walkthrough

Step through the full pipeline — from document and question, to [GROUND] tokens, to the three granularity heads and the hierarchical masks they decode.

Interactive walkthrough · plays automatically when it scrolls into view

↗ Open in new tab

Drag to pan · scroll to zoom · Space to play/pause. Best viewed on a wide screen.

🔗 Decoupled Language & Spatial Grounding

Unlike prior methods that interleave bounding-box coordinates inside generated text, M3Grounder decouples language modeling from spatial prediction. The VLM focuses on semantically accurate answers and span boundaries, while a dedicated segmentation head handles fine-grained, geometry-aware localization — yielding cleaner text outputs and more precise masks.

🔢 Autoregressive Multi-Span Grounding

M3Grounder autoregressively generates answer text with special [GROUND] tokens emitted immediately after each answer span, maintaining a one-to-one span-to-region link. This naturally handles multi-span answers drawing evidence from multiple, spatially distinct document regions in a single forward pass.

Multi-Granular Hierarchical Grounding

Documents follow a natural spatial hierarchy. M3Grounder jointly predicts evidence at three levels, enforced by a hierarchical enclosure loss.

🔤

Phrase Level

Fine-grained spans for extractive answers — names, dates, numeric values

📏

Line Level

Full text lines capturing neighboring context and reading-order cues

🔲

Block Level

Broader regions for abstractive or summary questions requiring wide context

Multi-granular grounding example on a bank statement

Multi-granular grounding on financial documents. M3Grounder grounds answer evidence at three hierarchical levels. C1 Phrase-level masks highlight the precise token spans corresponding to extracted values. C2 Line-level masks capture the surrounding transaction context, while C3 Block-level grounding retrieves the broader statement region containing the relevant entries.

Training Objectives

M3Grounder is trained end-to-end with four complementary loss terms:

𝓛_total = λ_lm𝓛_lm + λ_seg𝓛_seg + λ_bleed𝓛_bleed + λ_hier𝓛_hier

𝓛_lm — Language Modeling

Cross-entropy over all output tokens for accurate QA generation and span boundary prediction.

𝓛_seg — Segmentation (Dice + BCE)

Supervises mask predictions at each granularity level against pixel-level ground truth.

𝓛_bleed — Bleed Suppression

Penalizes mask overlap beyond text pixels, preventing spillover into irrelevant background regions.

𝓛_hier — Hierarchical Enclosure

Enforces phrase ⊂ line ⊂ block nesting, stabilizing multi-granular grounding spatially.

GroundingDocQA Dataset & Benchmark

📦 GroundingDocQA

Large-scale training dataset built with a novel data engine that handles complex layouts, curved-text, and graphic-rich documents — annotated with pixel-level masks at three granularities.

200K

Documents

QA Pairs

Multi-Span

Annotation

Mask Levels
(phrase, line, block)

Text-rich docs Charts Forms & tables Curved text Webpages Reports

📊 GroundingDocQA-Bench

Human-verified benchmark for systematic evaluation of grounding fidelity under multi-span, multi-granular, and curved-text conditions with mask-level annotations.

2.5K

Documents

QA Pairs

Human

Verified

Curved

Text Split (CS)

Single-span F₁ Multi-span F₁ Curved F₁ (CS) G-Eval (AQ)

GroundingDocQA data engine. The pipeline processes layout-aware, curved-text, and chart documents, generating multi-span and multi-granular QA pairs with pixel-level mask annotations at three levels (phrase, line, block). A human-in-the-loop verification stage ensures annotation quality across all document types.

Benchmark Results

M3Grounder achieves state-of-the-art grounding performance across all evaluated benchmarks. Gains are most pronounced on curved and skewed text, where segmentation-based localization substantially outperforms bounding-box approaches.

Hover bars for score · Sorted by performance (high → low)

Model	Params	BD-TestF1_g	DOGR-BenchF1_g	MMDocBenchIoU	GDQA-BenchF1_g
GPT-4o	-	2.3	2.9	2.5	3.4
GPT-5	-	5.4	3.6	9.3	4.5
Gemini-2.5-Pro	-	70.0	59.3	49.4	43.4
Gemma-3	12B	0.5	1.77	1.6	1.5
Kimi VL	16B	8.1	7.2	3.5	0.5
mPLUG-DocOwl2	9B	1.2	0.0	0.0	0.0
InternVL3.5	8B	41.5	12.5	15.5	7.6
Qwen3-VL	8B	44.5	27.6	28.7	12.8
Kosmos 2.5 Chat	1.3B	0.7	0.0	0.0	0.0
InternVL3.5 ft.	8B	58.7	27.6	37.6	52.7
Qwen3-VL ft.	8B	62.3	35.8	43.4	60.6
M3Grounder-I	8B	77.2	69.6	65.5	71.3
M3Grounder-Q	8B	81.4	73.3	68.2	79.0

Model	Phrase F₁	Line F₁	Block F₁
M3Grounder-I	75.2	79.8	82.5
M3Grounder-Q Best	80.1	84.3	87.5

General Document VQA

M3Grounder retains strong performance across 10 general document understanding benchmarks, showing no trade-off between grounding capability and VQA accuracy. Bold = highest score, underline = second-highest. — = not reported.

Hover bars for score · Sorted high → low · Models with — excluded per benchmark

Model	Size	Doc VQA	Info VQA	Deep Form	KLC	WTQ	Tab Fact	Chart QA	Text VQA	Text Caps	Visual MRC
DocOwl-2	8B	80.7	46.4	66.8	37.5	36.5	78.2	70.0	66.7	131.8	217.4
InternVL3.5	8B	92.3	79.1	—	—	—	—	86.7	78.2	—	—
Qwen3-VL	8B	96.1	83.1	—	—	—	—	—	—	—	—
DOGR	8B	91.7	70.7	70.8	40.4	58.8	84.5	83.6	76.6	145.9	332.5
M3Grounder-I	8B	90.3	76.2	71.9	41.6	59.7	81.9	82.2	78.6	147.2	291.8
M3Grounder-Q	8B	92.1	79.6	72.8	42.8	60.4	85.3	83.9	78.1	148.1	334.2

Table 3. Performance comparison of M3Grounder across 10 general document QA benchmarks. We compare against other models of similar size (8B). Bold = highest, underline = second-highest. — = not reported by the original work.

BibTeX

If you find our work useful in your research, please consider citing:

m3grounder_cvpr2026.bib

@inproceedings{venna2026m3grounder,
  title     = {M3Grounder: Mask-Based Multi-Span and Multi-Granular Grounding for Document QA},
  author    = {Venna, Venkata Kesav and Gunda, Sai Madhusudan and Jinka, Jyothi Swaroopa and Rachakonda, Hrithik Sagar and Srinivasan, Anirudh and Sarvadevabhatla, Ravi Kiran},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages     = {23685--23695},
  year      = {2026}
}

Video