Document Classification Pattern Recognition via Information Fusion: A systematic review of multimodal and multiview representation approaches

Most “document classification” problems look simple on paper: take a document, predict a label. In real systems, documents are richer objects—text, layout, visuals, metadata, and contextual signals all coexist. Information fusion is the umbrella term for methods that combine these signals to improve classification performance and robustness.

What problem does the paper tackle?

Although fusion is widely used, the evidence has been scattered across tasks, datasets, and model families. As a result, it has been hard to answer practical questions like:

When does fusion consistently help (and by how much)?
Should I invest in multimodal signals (e.g., text + layout + images) or focus on multiview representations (e.g., multiple text embeddings / feature sets)?
How reliable are published gains, given how often statistical validation is missing?

Multimodal vs. multiview—quick intuition

Multimodal fusion: combine different data modalities (e.g., textual content + page layout + figures/screenshots + metadata).
Multiview fusion: combine different “views” of the same modality (e.g., TF-IDF + embeddings + linguistic features; or multiple encoders for the same text).

What we did

Systematic review of 139 primary studies on fusion for document classification.
A unifying framework connecting modern representation learning with classic information fusion concepts, so results across papers become comparable.
Qualitative synthesis to map trends: what gets fused, where in the pipeline fusion happens, and what evaluation practices dominate.
Random-effects meta-analysis to quantify average gains (rather than relying on isolated benchmark wins).

Key results (numbers you can use)

Multimodal fusion improves accuracy by +5.28 percentage points on average (p=0.0016). The F1-score effect is positive in direction, but not statistically significant in the primary model.
Multiview fusion provides smaller but reliable gains: accuracy +4.67%, F1 +3.08%, and improved recall (all p<0.05).
Reproducibility gap: statistical testing is rare—only 11.8% (multimodal) and 23.3% (multiview) of studies report it—making many claimed improvements harder to trust.

Practical takeaways for building real systems

The main engineering message is that “stronger fusion” does not automatically mean “more complex models.” What matters most is choosing a fusion strategy that matches the constraints and structure of your problem:

If you have truly complementary signals (e.g., layout/visual cues that text alone cannot capture), multimodal fusion tends to deliver the strongest accuracy improvements.
If you mainly operate within text, multiview fusion can be a cost-effective way to gain robustness by combining different representations/encoders/features.
Validate rigorously: use statistical tests, multiple runs, and careful baselines—otherwise “wins” may be noise, dataset artefacts, or evaluation leakage.

Resources

We share the accompanying materials on Zenodo: https://zenodo.org/records/17141560.

Original article is available online: https://authors.elsevier.com/a/1mh3h5a7-H2wPi

If you work on document-heavy pipelines (routing/triage, compliance, e-discovery, scholarly search, customer support automation) and you’re considering fusion.

Pages: 1 2

Od Informacji do Wiedzy

Blog o informacjach na temat informacji i wiedzy