Similarity Explorer

Explore how Estonian and Finnish folk poems relate to each other. This page compares 292,000 poems and 4.29 million verse lines from three archives, using seven different methods to find similarities — from shared words to shared meaning across languages.

Tip: Press ? on your keyboard to toggle this panel. It auto-selects the section matching your current tab.

Five views

How similarity is measured

Seven methods look for connections in different ways:

Scores are percentages: higher = more similar. Colors match algorithms throughout the page.

AlgorithmUI labelMeasuresCross-lingual?
TF-IDFTF-IDF LemmaShared lemmas (IDF-weighted cosine)No
JaccardWordform OverlapShared word forms (set intersection)No
TranslationThematic (Translation)Shared English translations (pivoted cosine)Yes
AlignmentAlignmentVerse sequence structure (Wagner-Fischer DP)Partial
SemanticSemantic (GloVe)GloVe 300d + SIF embedding cosineYes
Verse MatchVerse Match4-algorithm verse-level RRF aggregated to poemMixed
RRFCombined (RRF)Reciprocal Rank Fusion of multiple algorithmsDepends

Data: 292,092 poems (SKVR ~89K Finnish, JR ~38K Finnish/Ingrian, ERAB ~165K Estonian), 4,291,553 verse lines.

Full methodology guide →

Poem Similarity

Start by searching for a poem (by archive ID like “H II 1, 389” or by a word in the text). You’ll see the most similar poems ranked by whichever method you choose.

Side-by-side comparison

The pair buttons (e.g., “TF-IDF vs Jaccard”) show two rankings next to each other — useful for seeing which matches are found by different methods. Poems appearing in both columns are especially reliable matches.

Combined (RRF)

Merges evidence from multiple methods. A poem ranks high only if several algorithms agree. The algorithm badges (T, J, Tr, A) on each card show which methods contributed.

Poem comparison overlay

Click “Compare” on any match to see both poems verse by verse. Highlighted words show what’s shared:

Curved lines between the poems connect matching verses. Thicker lines = stronger matches.

Network graph & similarity map

A web of connections shows how your poem relates to its matches and their matches. The similarity map plots where matches were collected geographically.

Tips

Verse Similarity

Search for a verse line and see which other verses across the corpus sound similar. Results come from four different methods plus a combined view.

Formulaic Patterns

Below the results: the Formulaic Patterns browser shows the 200 largest groups of recurring verse lines in the tradition. Click a cluster to load its representative verse and explore.

Verse Pairs

Browse the strongest verse-to-verse connections found across the whole corpus — the 50,000 most similar pairs, ranked by combined evidence from shared words, rare words, translations, and letter patterns.

Algorithm badges: J (blue) = Jaccard, T (green) = TF-IDF, Tr (teal) = Translation, C (pink) = CharBigram.

By Place Pair view shows identical formulas + near-matches = combined total. Sort by combined count, identical formulas, near-matches, best RRF score, or cross-lingual near-matches.

Geography

Five map-based views show where poetic traditions connect geographically.

Verse Network

Lines on the map connect collection locations; thicker lines mean stronger connections. Green = same language, gold = cross-lingual.

Cross-Lingual Formulas

Estonian-Finnish connections: which places and song types bridge the two traditions.

Poem Connections

Poem-level connections on the map. Lines connect places where similar poems were collected.

Combined

Merges verse-level and poem-level connections onto one map.

Place Focus

Pick a single place and see all its connections radiating outward.

Formulaicity

See which parts of a poem are traditional formulas shared across the tradition, and which are found nowhere else in the corpus. Each verse gets a color: dark green = widely shared, yellow = found nowhere else, gray = too short to score.

Each verse is scored by the maximum similarity across Jaccard, TF-IDF, Thematic (Translation), and CharBigram. Verses with fewer than 2 distinct words are excluded (SHORT badge). 99.8% of scorable verses match something in the corpus, supporting the Parry-Lord oral-formulaic composition theory. Algorithm badges on each match: Jac/TF/Tx/Bi.

Glossary

TermExplanation
AlignmentCompares poems by matching verses in order, like aligning two song sheets
CharBigramCompares how words look letter-by-letter; catches dialect spelling differences
Combined / RRFMerges rankings from several methods; strong only when multiple agree
Cosine similarityMeasures how similar two lists of numbers are (0 = unrelated, 1 = identical)
Cross-lingualAcross languages — here, Estonian and Finnish
ERABEstonian runosong archive (~165,000 poems)
FormulaA verse line that appears in many poems across the tradition
JaccardCounts shared exact words between two texts
JRFinnish/Ingrian archive of unpublished runosong (~38,000 poems)
LemmaDictionary form of a word (“laulu” and “laulude” → “laul”)
Near-matchAlgorithmically similar but not identical verses
RunosongTraditional Baltic-Finnic oral poetry (Estonian: regilaul, Finnish: runolaulu)
SKVRMain Finnish runosong archive (~89,000 poems)
Normalized surpriseHow unexpected a connection is, given the sizes of both places — divides shared count by the geometric mean of both collections
TF-IDFMatches root words, weighting rare shared words more heavily
Tradition StrengthTotal quality of verse connections between two places
Thematic (Translation)Compares texts through English translations; works across languages
Verse MatchCompares verse lines (shared words, rare words, translations, letter patterns), then aggregates to a poem score
New here? Press ? or click ? Help for a guide to this page.

Pick a poem and see which other poems are most like it. Choose two methods to compare side by side, or use Combined for an overall ranking. Click any match to jump to that poem and keep exploring.

Poem Similarity Explorer

Compare poems across two similarity algorithms side-by-side. Search for any poem above, or use ?poem=ID in the URL.

Seven algorithms: TF-IDF Lemma, Jaccard Wordform, Translation-Pivot, Alignment, Semantic (GloVe), Verse Match, and Combined RRF.

Poem Comparison

Search for a verse line and find similar ones across the corpus. Results are compared by shared words, rare words, translations, and letter patterns, plus a combined view.

  • Click the arrow on a match to explore from that verse — the verse trail tracks where you’ve been
  • Language badges (ET/FI) show which tradition each match comes from
Formulaic Patterns (curated entry points)

What are formula clusters?

In runosong, singers composed by combining stock verse formulas. A formula cluster groups verse lines that are near-identical variants of each other, found across many poems and places.

How they were discovered

The system analyzed 4.29 million verse lines using four similarity algorithms (Jaccard, TF-IDF, Translation, CharBigram), then combined evidence to identify clusters. The 200 largest clusters are shown here.

Reading the display

Members = total verse occurrences. Places = distinct collection locations. The language bar shows Estonian (blue) / Finnish (orange) proportion. Bilingual clusters appear in both traditions.

Integration

Click “Explore in Verse Tab” to load a formula into the Verse Explorer above.

Loading formulaic patterns...
Select a formula from the list to see details
Loading...

The 50,000 most similar verse pairs, ranked by combined evidence from shared words, rare words, translations, and letter patterns. Use By Place Pair to see which collection places share the most verse material.

Map 0 selected

Formulaicity

See which parts of a poem are traditional formulas shared across the tradition, and which are found nowhere else in the corpus.

Color strip

Each cell = one verse. Dark green = widely shared formula, yellow = found nowhere else, gray = too short to score.

How to use

Click a verse to see its top 5 matches from other poems. Score 0–100: 80+ = well-known formula, under 20 = largely unique. Algorithm badges (Jac/TF/Tx/Bi) show which method found each match.

Top 20 Most Formulaic Poems

Loading formulaicity data...
Select a poem to see its formulaic skeleton