From Scanned Pages to Semantic Graphs: Scalable Methods for Extracting Historical and Cultural Knowledge Across Heterogeneous Texts

dc.contributor.authorMalak, Piotren_US
dc.contributor.authorLetowska, Agnieszkaen_US
dc.contributor.authorWodzinski, Janen_US
dc.contributor.editorCampana, Stefanoen_US
dc.contributor.editorFerdani, Danieleen_US
dc.contributor.editorGraf, Holgeren_US
dc.contributor.editorGuidi, Gabrieleen_US
dc.contributor.editorHegarty, Zackaryen_US
dc.contributor.editorPescarin, Sofiaen_US
dc.contributor.editorRemondino, Fabioen_US
dc.date.accessioned2025-09-05T20:25:48Z
dc.date.available2025-09-05T20:25:48Z
dc.date.issued2025
dc.description.abstractWe present a multilayered methodology for processing digitized historical texts, enabling cross-relational analysis across time periods, languages, and subject domains. Drawing from multiple DH platforms (Tsadikim, Two Enlightenments, Corporeality), we demonstrate an integrated pipeline combining adaptive OCR, noise-tolerant keyword extraction, and NER. Custom preprocessing and fuzzy matching techniques allow for meaningful text recovery from degraded scans in Polish, German, and Yiddish. Data are enriched with spatial and temporal metadata, indexed by topic and linked across projects. The resulting datasets support trend analysis, social network modeling, and discourse mapping. Our approach enables researchers to trace linguistic shifts and intellectual networks over centuries without manual review of source pages. This workflow facilitates interoperable exploration of cultural data and demonstrates how machine learning can assist in recovering semantic relationships from fragmented historical records. The methodology was tested on Enlightenment-era and early 20th-century journals, revealing both technical challenges and insights into evolving ideological, medical, and theological vocabularies.en_US
dc.description.sectionheadersExtracting Knowledge from Digitized Assets
dc.description.seriesinformationDigital Heritage
dc.identifier.doi10.2312/dh.20253133
dc.identifier.isbn978-3-03868-277-6
dc.identifier.pages9 pages
dc.identifier.urihttps://doi.org/10.2312/dh.20253133
dc.identifier.urihttps://diglib.eg.org/handle/10.2312/dh20253133
dc.publisherThe Eurographics Associationen_US
dc.rightsAttribution 4.0 International License
dc.rights.urihttps://creativecommons.org/licenses/by/4.0/
dc.subjectCCS Concepts: Information systems → Digital libraries and archives; Computing methodologies → Natural language processing; Machine learning; Applied computing → Arts and humanities; Digital humanities; Human-centered computing → Visualization; Theory of computation → Ontologies
dc.subjectInformation systems → Digital libraries and archives
dc.subjectComputing methodologies → Natural language processing
dc.subjectMachine learning
dc.subjectApplied computing → Arts and humanities
dc.subjectDigital humanities
dc.subjectHuman centered computing → Visualization
dc.subjectTheory of computation → Ontologies
dc.titleFrom Scanned Pages to Semantic Graphs: Scalable Methods for Extracting Historical and Cultural Knowledge Across Heterogeneous Textsen_US
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
dh20253133.pdf
Size:
1 MB
Format:
Adobe Portable Document Format