RunoVerse – Dictionary Coverage

How well do dictionaries cover the runosong vocabulary?

What is dictionary coverage?

A word form is considered "covered" if its lemma (base form) appears as a headword in at least one of the nine dictionaries. For example, the word form laulnud is covered because its lemma laulma appears in EKSS. Coverage is measured at the word-form level: each of the 1,166,348 unique word forms is either covered or not.

Which dictionaries are included?

Six Estonian dictionaries: EKSS (standard explanatory), EMS (dialect), IMS (eastern dialect), ERLA (rare/archaic words), VMS (riddles), Seto (Seto dialect). Three Finnish dictionaries: KKS (Karelian), SMS (Finnish dialects), VKS (old literary Finnish). Together they cover both traditions in the runosong corpus.

How to read the coverage bars

Each bar shows the number and percentage of word forms whose lemma has an entry in that dictionary. Bars can be sorted by coverage, alphabetically, or grouped by language. A word form may be counted in multiple dictionaries if its lemma appears in more than one.

What does the overlap matrix show?

Each cell shows how many word forms are shared between two dictionaries. The diagonal shows each dictionary's total. Hover over a cell to see the overlap as a percentage. High overlap between Estonian dictionaries and between Finnish dictionaries is expected; cross-language overlap is much lower.

Word lookup

Use the search box to check which dictionaries cover a specific word. The lookup checks the annotation shards for dictionary matches and shows the first available definition.

Why is 83.5% coverage high despite only 29% of lemmas having entries?

High-frequency lemmas (common verbs, nouns, particles) are well-represented in dictionaries. The uncovered 16.5% consists mostly of rare hapax legomena, exclamations, refrain syllables, and highly dialectal compounds.

RunoVerse – dictionary coverage analysis across nine dictionaries · 1,166,348 unique word forms

1,166,348
unique word forms
439,746
lemmas
83.5%
word forms whose lemma has a dictionary entry
9
dictionaries (6 Estonian + 3 Finnish)
Largest dictionary
KKS
524,930 word forms (51.8%)
Largest overlap
SMS ∩ VKS
317,243 shared forms
Uncovered vocabulary
16.5%
192,246 forms without any dictionary entry

Coverage layers

Dictionary coverage is measured via lemma lookup: a word form is covered if its lemma has a dictionary entry.

Dictionary coverage (via lemma)
974,102 (83.5%)
83.5%
Dictionary or DeepSeek (via lemma)
1,163,350 (99.7%)
99.7%
Only 29% of lemmas (128,919) have a dictionary entry, but these tend to be high-frequency lemmas — so 83.5% of word forms are covered via lemma lookup. Combined with DeepSeek AI translations, coverage reaches 99.7%.

Per-dictionary word form coverage

Each bar shows how many unique word forms the dictionary covers. Word forms may appear in multiple dictionaries.

Finnish dictionaries
KKS (Karelian)
524,930
51.8%
SMS (Finnish dialects)
500,972
49.4%
VKS (Old literary language)
399,068
39.4%
Estonian dictionaries
EKSS (Explanatory)
296,489
29.2%
EMS (Dialect dictionary)
226,227
22.3%
IMS (Eastern dialect)
192,728
19.0%
ERLA (Rare words)
128,383
12.7%
VMS (Riddles dictionary)
85,405
8.4%
Seto (Seto dictionary)
19,774
2.0%
Finnish dictionaries (KKS, SMS, VKS) provide the largest coverage because the Finnic runosong corpus contains more Finnish poems by volume. EKSS is the broadest Estonian dictionary, covering standard vocabulary that also appears in runosongs.

Estonian vs Finnish dictionary coverage

Estonian dictionaries (6) Finnish dictionaries (3)

Overall vocabulary breakdown

83.5% via lemma
974,102 word forms – lemma has a dictionary entry
192,246 word forms – lemma has no dictionary entry
The uncovered 16.5% consists mainly of:
• Rare hapax legomena (single-occurrence forms)
• Exclamations, refrain particles, onomatopoeia
• Dialectal variants and compound words

Dictionary overlap matrix

How many word forms two dictionaries share (intersection). Diagonal shows each dictionary's total coverage.

EKSSEMSIMSERLAVMSSetoSMSKKSVKS
EKSS 296,489 208,758 187,662 122,837 81,605 17,180 82,365 78,001 48,438
EMS 208,758 226,227 146,174 91,952 64,213 16,888 68,780 54,217 42,252
IMS 187,662 146,174 192,728 82,753 58,283 15,467 54,395 52,995 33,594
ERLA 122,837 91,952 82,753 128,383 59,928 982 49,839 44,828 29,077
VMS 81,605 64,213 58,283 59,928 85,405 3,252 33,469 29,292 18,386
Seto 17,180 16,888 15,467 982 3,252 19,774 1,177 1,385 948
SMS 82,365 68,780 54,395 49,839 33,469 1,177 500,972 242,118 317,243
KKS 78,001 54,217 52,995 44,828 29,292 1,385 242,118 524,930 237,775
VKS 48,438 42,252 33,594 29,077 18,386 948 317,243 237,775 399,068
Pattern: Estonian dictionaries overlap strongly with each other (e.g., EKSS ∩ EMS: 208,758). Finnish dictionaries likewise (SMS ∩ VKS: 317,243). Overlap between Estonian and Finnish dictionaries is notably smaller – they cover essentially different vocabularies that complement each other. The Seto dictionary is small but overlaps primarily with other Estonian dictionaries.

Conclusions

RunoVerse · Dictionary coverage analysis · 1,166,348 unique word forms · 439,746 lemmas · 83.5% covered via lemma