From Scanned Pages to Semantic Graphs: Scalable Methods for Extracting Historical and Cultural Knowledge Across Heterogeneous Texts

Malak, Piotr; Letowska, Agnieszka; Wodzinski, Jan

From Scanned Pages to Semantic Graphs: Scalable Methods for Extracting Historical and Cultural Knowledge Across Heterogeneous Texts

dc.contributor.author	Malak, Piotr	en_US
dc.contributor.author	Letowska, Agnieszka	en_US
dc.contributor.author	Wodzinski, Jan	en_US
dc.contributor.editor	Campana, Stefano	en_US
dc.contributor.editor	Ferdani, Daniele	en_US
dc.contributor.editor	Graf, Holger	en_US
dc.contributor.editor	Guidi, Gabriele	en_US
dc.contributor.editor	Hegarty, Zackary	en_US
dc.contributor.editor	Pescarin, Sofia	en_US
dc.contributor.editor	Remondino, Fabio	en_US
dc.date.accessioned	2025-09-05T20:25:48Z
dc.date.available	2025-09-05T20:25:48Z
dc.date.issued	2025
dc.description.abstract	We present a multilayered methodology for processing digitized historical texts, enabling cross-relational analysis across time periods, languages, and subject domains. Drawing from multiple DH platforms (Tsadikim, Two Enlightenments, Corporeality), we demonstrate an integrated pipeline combining adaptive OCR, noise-tolerant keyword extraction, and NER. Custom preprocessing and fuzzy matching techniques allow for meaningful text recovery from degraded scans in Polish, German, and Yiddish. Data are enriched with spatial and temporal metadata, indexed by topic and linked across projects. The resulting datasets support trend analysis, social network modeling, and discourse mapping. Our approach enables researchers to trace linguistic shifts and intellectual networks over centuries without manual review of source pages. This workflow facilitates interoperable exploration of cultural data and demonstrates how machine learning can assist in recovering semantic relationships from fragmented historical records. The methodology was tested on Enlightenment-era and early 20th-century journals, revealing both technical challenges and insights into evolving ideological, medical, and theological vocabularies.	en_US
dc.description.sectionheaders	Extracting Knowledge from Digitized Assets
dc.description.seriesinformation	Digital Heritage
dc.identifier.doi	10.2312/dh.20253133
dc.identifier.isbn	978-3-03868-277-6
dc.identifier.pages	9 pages
dc.identifier.uri	https://doi.org/10.2312/dh.20253133
dc.identifier.uri	https://diglib.eg.org/handle/10.2312/dh20253133
dc.publisher	The Eurographics Association	en_US
dc.rights	Attribution 4.0 International License
dc.rights.uri	https://creativecommons.org/licenses/by/4.0/
dc.subject	CCS Concepts: Information systems → Digital libraries and archives; Computing methodologies → Natural language processing; Machine learning; Applied computing → Arts and humanities; Digital humanities; Human-centered computing → Visualization; Theory of computation → Ontologies
dc.subject	Information systems → Digital libraries and archives
dc.subject	Computing methodologies → Natural language processing
dc.subject	Machine learning
dc.subject	Applied computing → Arts and humanities
dc.subject	Digital humanities
dc.subject	Human centered computing → Visualization
dc.subject	Theory of computation → Ontologies
dc.title	From Scanned Pages to Semantic Graphs: Scalable Methods for Extracting Historical and Cultural Knowledge Across Heterogeneous Texts	en_US

Files

Original bundle

Now showing 1 - 1 of 1

Name:: dh20253133.pdf
Size:: 1 MB
Format:: Adobe Portable Document Format

Download

Collections

Track 05 – Analysis and Interpretation