This page presents the emotion-related vocabulary of Finnic runosongs (regilaulud) — 63,871 word forms grouped into 38 families across 26 semantic domains. The dataset was assembled by combining 12 detection methods. Most operate on English translations of lemmatized runosong texts produced by the DeepSeek R1 LLM; the substitution test is the one method that works directly on the Estonian verse templates themselves.
All methods are merged with consensus scoring: a word form is confirmed as emotional only when ≥2 independent methods agree. Each family is then reviewed by Claude Opus against actual verse evidence, word-form translations, and corpus frequencies, and three layers of corrections are applied (per-form moves, lemma-level merges, recovered forms). Final corrections propagate to the RunoVerse lexicon and poem index.
E = core emotion word (e.g., rõõm ‘joy’). M = derivation of an emotion word (e.g., rõõmustama ‘to rejoice’). A = dialect variant or lemmatization error merged to its correct family (e.g., reem → rõõm). J = contextually related but not an emotion itself (e.g., laul ‘song’, süda ‘heart’). N (noise) candidates are excluded from the browser.
This page uses the emotion-pipeline substitution test: a seed word is removed from its verse to form a template, and every word in the corpus that fills the same template position becomes a candidate for the seed’s emotion family. Templates are scored by IDF so that specific, low-frequency slots count more than generic ones. The method is seed-driven and iterates to depth 5.
The separate Substitution Explorer is built from the main RunoVerse similarity pipeline: 701K substitution pairs produced by cross-poem (XP) and within-poem parallelism (CP) algorithms. It is not seed-driven and not restricted to emotion words — it surfaces any two words that substitute for each other in parallel verse lines. The two datasets are complementary (only ~13.6% overlap).
Each family page includes translation cognates — Estonian lemmas from the runosong corpus that share English translations with the family’s members. The RunoVerse lexicon (242K glosses, 1.5M mappings from DeepSeek AI translations) is inverted to build per-lemma gloss profiles. For each family, lemmas whose translation profiles overlap with the family’s are scored using an IDF-weighted overlap coefficient: common glosses (‘to’, ‘little’) are downweighted via idf = log(1 + N/(1 + df)), while specific glosses (‘contempt’, ‘mock’) count more. Candidates require ≥2 shared glosses and a minimum IDF sum. This discovers semantically related words that the substitution test may have missed.
Corpus: Estonian (ERAB) and Finnish (SKVR, JR) runosong collections (7.3M + 7.4M tokens, 451K + 701K unique forms). Translations from the RunoVerse lexicon (DeepSeek R1 AI translations, 1.36M entries). BERT embeddings fine-tuned on runosong texts (190K words × 768 dimensions) are used for cognate scoring. Full pipeline documentation: EMOTION_LEXICON_PIPELINE_DOCUMENTATION.md.
Note: This vocabulary is computationally derived and AI-reviewed. It is intended as a research tool, not a definitive classification. Dialectal and archaic Estonian and Finnish (13th–19th century) presents inherent lemmatization challenges; the estimated false-positive rate after consensus + Claude review is ~9%.