How well do dictionaries cover the runosong vocabulary?
What is dictionary coverage?
A word form is considered "covered" if its lemma (base form) appears as a headword in at least one of the nine dictionaries. For example, the word form laulnud is covered because its lemma laulma appears in EKSS. Coverage is measured at the word-form level: each of the 1,166,348 unique word forms is either covered or not.
Which dictionaries are included?
Six Estonian dictionaries: EKSS (standard explanatory), EMS (dialect), IMS (eastern dialect), ERLA (rare/archaic words), VMS (riddles), Seto (Seto dialect). Three Finnish dictionaries: KKS (Karelian), SMS (Finnish dialects), VKS (old literary Finnish). Together they cover both traditions in the runosong corpus.
How to read the coverage bars
Each bar shows the number and percentage of word forms whose lemma has an entry in that dictionary. Bars can be sorted by coverage, alphabetically, or grouped by language. A word form may be counted in multiple dictionaries if its lemma appears in more than one.
What does the overlap matrix show?
Each cell shows how many word forms are shared between two dictionaries. The diagonal shows each dictionary's total. Hover over a cell to see the overlap as a percentage. High overlap between Estonian dictionaries and between Finnish dictionaries is expected; cross-language overlap is much lower.
Word lookup
Use the search box to check which dictionaries cover a specific word. The lookup checks the annotation shards for dictionary matches and shows the first available definition.
Why is 83.5% coverage high despite only 29% of lemmas having entries?
High-frequency lemmas (common verbs, nouns, particles) are well-represented in dictionaries. The uncovered 16.5% consists mostly of rare hapax legomena, exclamations, refrain syllables, and highly dialectal compounds.
RunoVerse – dictionary coverage analysis across nine dictionaries · 1,166,348 unique word forms
1,166,348
unique word forms
439,746
lemmas
83.5%
word forms whose lemma has a dictionary entry
9
dictionaries (6 Estonian + 3 Finnish)
Largest dictionary
KKS
524,930 word forms (51.8%)
Largest overlap
SMS ∩ VKS
317,243 shared forms
Uncovered vocabulary
16.5%
192,246 forms without any dictionary entry
Coverage layers
Dictionary coverage is measured via lemma lookup: a word form is covered if its lemma has a dictionary entry.
Dictionary coverage (via lemma)
974,102 (83.5%)
83.5%
Dictionary or DeepSeek (via lemma)
1,163,350 (99.7%)
99.7%
Only 29% of lemmas (128,919) have a dictionary entry, but these tend to be high-frequency lemmas — so 83.5% of word forms are covered via lemma lookup. Combined with DeepSeek AI translations, coverage reaches 99.7%.
Per-dictionary word form coverage
Each bar shows how many unique word forms the dictionary covers. Word forms may appear in multiple dictionaries.
Finnish dictionaries
KKS (Karelian)
524,930
51.8%
SMS (Finnish dialects)
500,972
49.4%
VKS (Old literary language)
399,068
39.4%
Estonian dictionaries
EKSS (Explanatory)
296,489
29.2%
EMS (Dialect dictionary)
226,227
22.3%
IMS (Eastern dialect)
192,728
19.0%
ERLA (Rare words)
128,383
12.7%
VMS (Riddles dictionary)
85,405
8.4%
Seto (Seto dictionary)
19,774
2.0%
Finnish dictionaries (KKS, SMS, VKS) provide the largest coverage because the Finnic runosong corpus contains more Finnish poems by volume. EKSS is the broadest Estonian dictionary, covering standard vocabulary that also appears in runosongs.
Estonian vs Finnish dictionary coverage
Estonian dictionaries (6) Finnish dictionaries (3)
Overall vocabulary breakdown
974,102 word forms – lemma has a dictionary entry
192,246 word forms – lemma has no dictionary entry
The uncovered 16.5% consists mainly of:
• Rare hapax legomena (single-occurrence forms)
• Exclamations, refrain particles, onomatopoeia
• Dialectal variants and compound words
Dictionary overlap matrix
How many word forms two dictionaries share (intersection). Diagonal shows each dictionary's total coverage.
EKSS
EMS
IMS
ERLA
VMS
Seto
SMS
KKS
VKS
EKSS
296,489
208,758
187,662
122,837
81,605
17,180
82,365
78,001
48,438
EMS
208,758
226,227
146,174
91,952
64,213
16,888
68,780
54,217
42,252
IMS
187,662
146,174
192,728
82,753
58,283
15,467
54,395
52,995
33,594
ERLA
122,837
91,952
82,753
128,383
59,928
982
49,839
44,828
29,077
VMS
81,605
64,213
58,283
59,928
85,405
3,252
33,469
29,292
18,386
Seto
17,180
16,888
15,467
982
3,252
19,774
1,177
1,385
948
SMS
82,365
68,780
54,395
49,839
33,469
1,177
500,972
242,118
317,243
KKS
78,001
54,217
52,995
44,828
29,292
1,385
242,118
524,930
237,775
VKS
48,438
42,252
33,594
29,077
18,386
948
317,243
237,775
399,068
Pattern: Estonian dictionaries overlap strongly with each other (e.g., EKSS ∩ EMS: 208,758). Finnish dictionaries likewise (SMS ∩ VKS: 317,243). Overlap between Estonian and Finnish dictionaries is notably smaller – they cover essentially different vocabularies that complement each other. The Seto dictionary is small but overlaps primarily with other Estonian dictionaries.
Conclusions
Dictionary coverage reaches 83.5% of word forms (974,102) via lemma lookup. Only 29% of lemmas have dictionary entries, but these tend to be high-frequency lemmas covering the majority of word forms.
Combined with DeepSeek AI translations, 99.7% of the vocabulary (1,163,350 forms) is covered. The remaining 0.3% consists of extremely rare or non-lexical items.
Finnish dictionaries (KKS, SMS, VKS) provide the greatest coverage by volume because the corpus contains more Finnish runosongs.
Estonian and Finnish dictionaries overlap little – they cover different vocabularies that complement each other.
Nine dictionaries (6 Estonian + 3 Finnish) collectively cover 98.4% of annotated word forms, with the Seto dictionary adding coverage for South-Estonian dialectal vocabulary.
RunoVerse · Dictionary coverage analysis · 1,166,348 unique word forms · 439,746 lemmas · 83.5% covered via lemma