Document QA requires not only accurate answers but also identifying where each answer is grounded on the page. Most approaches treat the task as text-only generation, while existing answer grounding methods generate coarse bounding boxes that fail to capture curved text. We introduce M3Grounder, a hybrid vision-language and segmentation architecture that formulates document grounding as pixel-level segmentation.
M3Grounder produces fine-grained evidence masks refined by a bleed-suppression loss to
prevent spillover. It autoregressively generates answer text interleaved with [GROUND]
tokens that link individual answer spans to their corresponding evidence regions. M3Grounder also grounds
evidence hierarchically across phrase, line, and block levels using an enclosure loss that
enforces spatial containment (phrase ⊂ line ⊂ block).
We release GroundingDocQA, a large-scale dataset of 200K documents and 2M multi-span and multi-granular QA pairs with pixel-level grounding masks, built through a data engine handling complex layouts, curved-text, and graphic-rich documents. We also release GroundingDocQA-Bench, a diverse human-verified benchmark. M3Grounder sets a new state-of-the-art in grounded DocVQA, advancing from coarse boxes to hierarchical, fine-grained and contextually grounded mask evidence.
Overview of M3Grounder.
Given a document image and question, a VLM (1) encodes visual features and
autoregressively generates an answer sequence (2) interleaved with [GROUND]
tokens.
The hidden states of [GROUND] tokens (3) are projected by three
granularity-specific
MLP heads into phrase-, line-, and block-level prompt embeddings (4). A promptable
segmentation module (5) extracts dense image embeddings reused across all spans and
granularities.
The prompt embeddings are decoded by a mask decoder to produce hierarchical grounding masks
(6),
mapping each answer span to its evidence at all granularities.
Unlike prior methods that interleave bounding-box coordinates inside generated text, M3Grounder decouples language modeling from spatial prediction. The VLM focuses on semantically accurate answers and span boundaries, while a dedicated segmentation head handles fine-grained, geometry-aware localization — yielding cleaner text outputs and more precise masks.
M3Grounder autoregressively generates answer text with special [GROUND] tokens
emitted immediately after each answer span, maintaining a one-to-one span-to-region link.
This naturally handles multi-span answers drawing evidence from multiple, spatially distinct
document regions in a single forward pass.
Documents follow a natural spatial hierarchy. M3Grounder jointly predicts evidence at three levels, enforced by a hierarchical enclosure loss.
Fine-grained spans for extractive answers — names, dates, numeric values
Full text lines capturing neighboring context and reading-order cues
Broader regions for abstractive or summary questions requiring wide context
Multi-granular grounding on financial documents. M3Grounder grounds answer evidence at three hierarchical levels. C1 Phrase-level masks highlight the precise token spans corresponding to extracted values. C2 Line-level masks capture the surrounding transaction context, while C3 Block-level grounding retrieves the broader statement region containing the relevant entries.
M3Grounder is trained end-to-end with four complementary loss terms:
𝓛total = λlm𝓛lm + λseg𝓛seg + λbleed𝓛bleed + λhier𝓛hier
M3Grounder produces precise, hierarchical segmentation masks across diverse document types — scanned documents, forms, charts, and curved-text banners.
Precise multi-span grounding in financial reports, linking narrative claims to both textual evidence and tabular values.
Medical report reasoning with grounded evidence, connecting abnormal lab measurements to diagnostic interpretation.
Dense multi-span grounding in legal documents, retrieving multiple clauses that jointly satisfy a query.
Structured evidence grounding in academic records, extracting identifiers and performance indicators across table rows.
Large-scale training dataset built with a novel data engine that handles complex layouts, curved-text, and graphic-rich documents — annotated with pixel-level masks at three granularities.
Human-verified benchmark for systematic evaluation of grounding fidelity under multi-span, multi-granular, and curved-text conditions with mask-level annotations.
GroundingDocQA data engine. The pipeline processes layout-aware, curved-text, and chart documents, generating multi-span and multi-granular QA pairs with pixel-level mask annotations at three levels (phrase, line, block). A human-in-the-loop verification stage ensures annotation quality across all document types.
M3Grounder achieves state-of-the-art grounding performance across all evaluated benchmarks. Gains are most pronounced on curved and skewed text, where segmentation-based localization substantially outperforms bounding-box approaches.
| Model | Params |
BD-Test
F1g
|
DOGR-Bench
F1g
|
MMDocBench
IoU
|
GDQA-Bench
F1g
|
|---|---|---|---|---|---|
| GPT-4o | - | 2.3 | 2.9 | 2.5 | 3.4 |
| GPT-5 | - | 5.4 | 3.6 | 9.3 | 4.5 |
| Gemini-2.5-Pro | - | 70.0 | 59.3 | 49.4 | 43.4 |
| Gemma-3 | 12B | 0.5 | 1.77 | 1.6 | 1.5 |
| Kimi VL | 16B | 8.1 | 7.2 | 3.5 | 0.5 |
| mPLUG-DocOwl2 | 9B | 1.2 | 0.0 | 0.0 | 0.0 |
| InternVL3.5 | 8B | 41.5 | 12.5 | 15.5 | 7.6 |
| Qwen3-VL | 8B | 44.5 | 27.6 | 28.7 | 12.8 |
| Kosmos 2.5 Chat | 1.3B | 0.7 | 0.0 | 0.0 | 0.0 |
| InternVL3.5 ft. | 8B | 58.7 | 27.6 | 37.6 | 52.7 |
| Qwen3-VL ft. | 8B | 62.3 | 35.8 | 43.4 | 60.6 |
| M3Grounder-I | 8B | 77.2 | 69.6 | 65.5 | 71.3 |
| M3Grounder-Q | 8B | 81.4 | 73.3 | 68.2 | 79.0 |
M3Grounder achieves state-of-the-art grounding performance across all evaluated benchmarks. Gains are most pronounced on curved and skewed text, where segmentation-based localization substantially outperforms bounding-box approaches.
Block-level grounding achieves the highest F₁, confirming that wider spatial context is effectively leveraged.
| Model | Phrase F₁ | Line F₁ | Block F₁ |
|---|---|---|---|
| M3Grounder-I | 75.2 | 79.8 | 82.5 |
| M3Grounder-Q Best | 80.1 | 84.3 | 87.5 |