About the RunoVerse
What this page covers
This About page is the reference guide for the entire RunoVerse platform. It documents the three source corpora (SKVR, JR, ERAB), corpus-wide statistics (439K lemmas, 15.3M tokens, 292K poems), and the methodology behind the lemmatization and AI annotation pipelines.
Sections on this page
Source corpora — descriptions and poem counts for each collection, with a visual bar chart.
Lexicon statistics — counts for lemmas, word forms, tokens, cognate pairs, translation/etymology confidence, and gloss coverage.
Data sources — explains the three source categories (Corpus Only, Both Sources, DeepSeek Only) and what the color-coded word forms and agreement badges mean.
Similarity — documents all 5 poem-level and 4 verse-level similarity algorithms with their basis and coverage.
Dictionary annotations — lists all 9 lexicographic sources (EMS, EKSS, IMS, ERLA, VMS, Seto, SMS, KKS, VKS).
Lemmatization — describes the Estonian (EstNLTK) and Finnish (Omorfi/Voikko/Stanza) processing pipelines, plus the DeepSeek AI annotation layer.
Explore cards
The card grid below links to all 70+ pages in RunoVerse. Each card shows a short description of the tool. For longer explanations, see the five Feature Guides (Lexicon, Similarity, Poetics, Cross-Lingual, Corpus).
Navigating the site
Use the top navigation bar to reach the main pages (Dictionary, Reader, Similarity, About). The More dropdown provides access to every explorer page. The Site Map organizes all pages into 11 categories. The Dashboard offers a visual starting point with hero statistics and a research question guide.
What is this?
The RunoVerse is a combined word index of Finnish and Estonian runosong (folk poetry) corpora. It brings together lemmatized word data from three major collections, allowing cross-linguistic exploration of the shared Finnic poetic tradition.
Please note that the RunoVerse is under active development. The lemmatization of historical dialectal texts is inherently approximate, and the AI-generated translations, etymological analyses, and similarity metrics should be considered experimental. The statistics and counts shown may change as the data is refined. This tool is intended as an exploratory aid, not as a definitive reference.
Source corpora
| Collection | Language | Description |
|---|---|---|
| SKVR | Finnish | Suomen Kansan Vanhat Runot – published Kalevala-metre poetry. Finnish Literature Society (SKS). 89,247 poems. |
| JR | Finnish | Julkaisemattomat Runot – unpublished folk poetry from SKS folklore archives. 96,129 poems. |
| ERAB | Estonian | Eesti Regilaulude Andmebaas – Database of Estonian Runosongs. Estonian Folklore Archives, Estonian Literary Museum. 108,969 poems. |
Lexicon statistics
| Measure | Count | Notes |
|---|---|---|
| Unique lemmas | 439,746 | Distinct base forms across all corpora (incl. 183,137 DeepSeek-only) |
| Unique wordforms | 1,480,455 | Distinct word tokens occurring across all poem texts |
| Wordform–lemma mappings | 2,083,995 | Total mappings from inflected forms to lemmas (one wordform can map to multiple lemmas) |
| Total tokens | 15,264,640 | Total word occurrences in source texts |
| Poems | 292,092 | Unique poems with full verse texts available in the poem context viewer (294,345 total in source corpora; some excluded due to missing verse text data) |
| Finnish-only lemmas | 206,518 | Lemmas from Finnish corpus only (SKVR/JR collections) |
| Estonian-only lemmas | 100,835 | Lemmas from Estonian corpus only (ERAB) |
| Shared (Finnic) lemmas | 1,240 | Lemmas found in both Finnish and Estonian sources |
| Cognate pairs (ET↔FI) | 6,382 | Automatically discovered Estonian-Finnish cognate pairs based on translation overlap, etymological roots, and orthographic similarity (1,114 exact, 2,390 near-exact, 2,873 bridged, 5 orthographic) |
| Translation confidence | 192,236 | Lemmas with DeepSeek translation consistency score (8,446 strong, 11,069 good, 41,650 moderate, 131,071 low; 113,502 no data) |
| Etymology confidence | 211,919 | Lemmas with DeepSeek etymology consistency score (11,748 strong, 13,183 good, 46,078 moderate, 140,910 low; 93,819 no data) |
| Gloss coverage | 91.3% | Word forms with English translation (1,344,094 of 1,472,442), including 11,423 Claude Opus supplementary glosses |
| Corpus attestations | 15,264,640 | SKVR: 4,522,811 · JR: 3,398,967 · ERAB: 7,341,908 |
Statistics reflect the current state of the lemmatized data and may change as lemmatization is refined.
Data sources and source filter
Each lemma in the lexicon has been tagged with one of three source categories, reflecting how it was identified. The source filter dropdown in the main view lets you filter by these categories:
| Source | Lemmas | Meaning |
|---|---|---|
| Corpus Only | 5,217 | Lemma was identified by the corpus lemmatization pipeline. None of the word forms listed under this lemma were matched to DeepSeek annotations during the merge. However, the lemma string itself may still appear as a word form in DeepSeek data, which means some “Corpus Only” entries can still have AI-generated translations visible via the A–Z browse. |
| Both Sources | 251,392 | Lemma comes from the corpus pipeline, and at least one of its word forms also appears in the DeepSeek annotations (possibly under a different lemma). These entries typically include AI-generated translations and may have cross-references (dsLemma) to alternative lemmatizations. Word forms in “Both Sources” entries are color-coded: green when both systems agree on the lemma, amber when DS assigns a different lemma, and gray when the word form is not in DS data. |
| DeepSeek Only | 183,137 | Lemma exists only in the DeepSeek annotations. The underlying word forms often appear in the corpus under different lemmas (96% of cases), but this particular lemmatization is unique to the AI analysis. |
The source categories reflect word-form-level overlap between the two lemmatization systems, not whether an entry has translations. Because the corpus pipeline and DeepSeek sometimes lemmatize the same word forms differently, a word form can belong to a “Corpus Only” lemma while also appearing independently in the DeepSeek data under a different lemma. The “Both Sources” category captures entries where the same word forms were recognized by both systems.
The agreement badge in the DeepSeek tab shows a ratio like “30/35 agree +15 n/a”, meaning 30 out of 35 DS-covered word forms have the same lemma in both systems, and 15 word forms are not present in the DS data. Hover over the badge for a full breakdown.
DeepSeek AI annotations
A subset of the corpus was independently annotated using DeepSeek, a large language model, to provide additional linguistic analysis. The AI annotations include:
- English translations of Estonian and Finnish word forms
- Etymological notes and cognate identification
- Morphological descriptions (case, number, tense, etc.)
- Part-of-speech tagging
| Measure | Count | Notes |
|---|---|---|
| DeepSeek tokens | 5,962,070 | AI-annotated token occurrences (ET: 2,867,388 + FI: 3,094,682) |
| DeepSeek-only lemmas | 183,137 | Lemmas unique to the AI analysis |
| English translations | 241,141 | Unique English terms extracted from AI annotations, browsable via A–Z (1,252,781 total mappings) |
| Cross-references | 91,754 | Entries linking to alternative lemmatizations between corpus and DeepSeek |
AI-generated annotations are provided as supplementary material and have not been manually verified. They should be used with appropriate caution, particularly for etymological claims and translations of rare dialectal forms.
Similarity and embedding data
The lexicon includes two word-level similarity systems to help explore relationships between word forms:
- Word form similarity – Edit-distance and phonological similarity between inflected forms across the corpus. Covers 1,166,348 word forms with ranked nearest neighbors and lemma-agreement indicators.
- BERT embeddings – Contextual nearest neighbors from a BERT model fine-tuned on Estonian runosong texts. Provides 190,975 query lemmas with their 10 nearest semantic neighbors, capturing meaning-based rather than form-based similarity.
Poem similarity
Five poem-level similarity algorithms identify related poems across the 292,092-poem corpus. Results are available in the Poem Reader (Related Poems panel) and the standalone Similarity Explorer with side-by-side comparison, network graphs, and geographic/temporal analytics.
| Algorithm | Basis | Description |
|---|---|---|
| TF-IDF Lemma | Lemma-level | Cosine similarity on TF-IDF vectors of lemmatized poem texts. Captures thematic similarity through shared vocabulary, weighted by corpus-level term importance. Top 50 neighbors per poem. |
| Wordform Overlap (Jaccard) | Exact wordforms | Jaccard index (|A∩B| / |A∪B|) over raw wordform sets. Identifies poems sharing exact surface forms, useful for detecting formulaic lines and direct textual parallels. |
| Thematic (Translation-pivot) | Cross-lingual | Boolean-IDF cosine similarity over English translations derived from DeepSeek annotations, with lemma-level fallback for improved coverage. Enables cross-lingual comparison between Estonian and Finnish poems via a shared semantic space. Top 50 neighbors per poem. |
| Alignment | Character n-gram | Verse sequence alignment using character bigram cosine similarity and Wagner-Fischer dynamic programming, from the FILTER project (Janicki, Kallio & Sarv 2023). Covers 256,970 poems across SKVR, JR, KR, and ERAB. Captures structural similarity — poems that follow the same verse order score high. Shows aligned verse pair excerpts for top matches. Top 50 neighbors per poem. |
| Verse-level RRF | Verse-level fusion | Fuses Jaccard, TF-IDF, Translation, and CharBigram similarity at the verse level using Average-Best-Per-Verse aggregation, then combines all four via Reciprocal Rank Fusion (k=60) into a single poem-level ranking. Shows T/J/Tr/C algorithm contribution badges. |
The Similarity Explorer shows cross-algorithm agreement badges (BOTH) when poems appear in multiple algorithms' results, and ET↔FI badges for cross-lingual matches in the Translation-pivot, Alignment, and Verse-level RRF algorithms.
Verse similarity
Four algorithms (Jaccard, TF-IDF, Translation-pivot, CharBigram) operate at the individual verse level across 4.29 million verse occurrences. Each verse is compared against all verses in other poems, with up to 20 nearest neighbors stored per algorithm.
| Metric | Value |
|---|---|
| Total verses indexed | 4,291,553 |
| Poems with verse data | 289,702 |
| Unique verse types (search index) | 2,906,535 |
| Formulaic pattern clusters | 200 |
Verse similarity is available in the Poem Reader (click the expand arrow on any verse line) and the Verse Similarity Explorer, which also provides full-text verse search and a browser for the top 200 formulaic patterns – recurring verse lines ranked by cluster size across both corpora.
Explore the lexicon
Feature guides
The RunoVerse contains over 30 interconnected tools for exploring the Finnic runosong tradition. These guides describe each tool in detail — what it shows, what data powers it, and how to use it.
Key features
- Search by lemma, word form, or English translation with diacritics-insensitive matching
- Filter by language, part of speech, and data source (corpus, DeepSeek, or both)
- Dictionary annotations from 9 Estonian and Finnish lexicographic sources
- DeepSeek AI translations, etymology, and morphological descriptions for 165K poems
- Five poem similarity algorithms with network graphs, geographic maps, and temporal analytics
- Verse-level similarity across 4.29M verses with inline expansion and full-text search
- Cross-lingual exploration: 6,382 cognate pairs, 1,240 shared lemmas, 49K etymology families
- Poetic analysis: alliteration patterns, semantic parallelism, formulaic phrases, and meter
- Poem reader with word-level glosses, POS tags, and per-verse similarity
- Geographic and temporal corpus analysis across 803 collection places and four centuries
- Bookmarkable deep links, keyboard navigation, and CSV export
Dictionary annotations
Word entries are enriched with definitions from Estonian and Finnish lexicographic sources:
- EMS – Eesti murrete sõnaraamat (Dictionary of Estonian Dialects). Institute of the Estonian Language.
- EKSS – Eesti keele seletav sõnaraamat (Explanatory Dictionary of Estonian). Institute of the Estonian Language.
- IMS – Ida-Eesti murdesõnastik (Eastern Estonian Dialect Dictionary). Institute of the Estonian Language.
- ERLA – Harva ja vähem-kasutatavate sõnade sõnastik (Glossary of Rare Folk-Song Words). Estonian Literary Museum.
- VMS – Vähemtuntud murdesõnade seletusi (Glossary of Lesser-Known Dialect Words). Estonian Literary Museum.
- Seto – Seto sõnastik (Seto Dictionary). Inge Käsi, Institute of the Estonian Language, 2016.
- SMS – Suomen murteiden sanakirja (Dictionary of Finnish Dialects). Kotimaisten kielten keskus (Kotus). CC BY 4.0.
- KKS – Karjalan kielen sanakirja (Dictionary of the Karelian Language). Kotimaisten kielten keskus (Kotus). CC BY 4.0.
- VKS – Vanhan kirjasuomen sanakirja (Dictionary of Old Literary Finnish). Kotimaisten kielten keskus (Kotus). CC BY 4.0.
Lemmatization
Estonian texts were lemmatized using EstNLTK morphological analysis combined with multiple lexical resources (EMS, EKSS, VES, ERLA, and others), expert manual annotations (37% of the corpus), and iterative automated correction cycles.
Finnish texts were lemmatized using a combinatory approach with a multi-tier fallback chain including Omorfi, Voikko, and Stanza, supplemented by a dialectal dictionary derived from Suomen murteiden sanakirja (SMS).
In addition, approximately 165,000 poems were independently annotated using DeepSeek-R1, a large language model, run on the LUMI supercomputer. The AI analysis produced lemmatizations, English translations, morphological descriptions, and etymological roots for each word token. These annotations yielded 107,110 additional lemmas not present in the corpus pipeline, and provided cross-references between the two lemmatization systems for entries where both recognized the same word forms.
References and acknowledgements
Corpora
- SKVR – Finnish Literature Society (Suomalaisen Kirjallisuuden Seura, SKS). Suomen Kansan Vanhat Runot. Digital corpus. skvr.fi. CC BY 4.0.
- JR – Finnish Literature Society (SKS). Julkaisemattomat Runot. Available within the SKVR database.
- ERAB – Oras, J.; Saarlo, L.; Sarv, M.; Labi, K.; Uus, M.; Šmitaite, R. (comps.). Eesti Regilaulude Andmebaas. Estonian Folklore Archives, Estonian Literary Museum. 2003–present. folklore.ee/regilaul/andmebaas
Lemmatization tools
- EstNLTK – Laur, S.; Orasmaa, S.; Särg, D.; Tammo, P. (2020). EstNLTK 1.6: Remastered Estonian NLP Pipeline. Proceedings of LREC 2020, pp. 7154–7162. github.com/estnltk/estnltk
- Vabamorf – Kaalep, H. J.; Vaino, T. (2001). Complete morphological analysis in the linguist’s toolbox. Congressus Nonus Internationalis Fenno-Ugristarum, 5, pp. 9–16.
- Omorfi – Pirinen, T. A. (2015). Omorfi – Free and open source morphological lexical database for Finnish. Proceedings of NODALIDA 2015, pp. 313–315. github.com/flammie/omorfi
- Voikko – Pitkänen, H. Voikko – Free linguistic software for Finnish. voikko.puimula.org
- Stanza – Qi, P.; Zhang, Y.; Zhang, Y.; Bolton, J.; Manning, C. D. (2020). Stanza: A Python Natural Language Processing Toolkit for Many Human Languages. Proceedings of ACL 2020: System Demonstrations. stanfordnlp.github.io/stanza
- DeepSeek-R1 – DeepSeek-AI (2025). DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv preprint arXiv:2501.12948. deepseek.com
Lexical resources
- SMS – Kotimaisten kielten keskus (Kotus). Suomen murteiden sanakirja (Dictionary of Finnish Dialects). kaino.kotus.fi/sms. CC BY 4.0.
- KKS – Kotimaisten kielten keskus (Kotus). Karjalan kielen sanakirja (Dictionary of the Karelian Language). kaino.kotus.fi/kks. CC BY 4.0.
- VKS – Kotimaisten kielten keskus (Kotus). Vanhan kirjasuomen sanakirja (Dictionary of Old Literary Finnish). CC BY 4.0.
- EMS – Institute of the Estonian Language. Eesti murrete sõnaraamat. eki.ee/dict/ems
- EKSS – Institute of the Estonian Language. Eesti keele seletav sõnaraamat. eki.ee/dict/ekss
- VES – Võro Institute. Võro-eesti sýnaraamat (comp. Jüvä Sullõv). folklore.ee/Synaraamat
- ERLA – Estonian Literary Museum. Harva ja vähem-kasutatavate sõnade sõnastik. folklore.ee/laulud/erla
Contact
For questions about this lexicon, contact kaarel.veskis@kirmus.ee