Verse Similarity Analysis

Cross-algorithm comparison of verse-level similarity across 4.2 million Estonian and Finnish runosong verses, with formulaic cluster analysis and geographic spread.

What this page shows

A statistical analysis of verse-level similarity across the entire Finnic runosong corpus (4.29 million verse occurrences, 289,702 poems). The analysis compares how four different algorithms find similar verses and identifies formulaic patterns.

Algorithm Comparison Dashboard

Jaccard — exact wordform overlap between two verse lines (|intersection| / |union|), with adaptive minimum shared words and IDF weighting.
TF-IDF — lemma-level cosine similarity, weighted by term rarity across the corpus.
Translation — cross-lingual similarity via English translation vectors, enabling Estonian-Finnish matching.
Char Bigram — character bigram overlap using FAISS approximate nearest neighbor search, capturing orthographic similarity between verses.
Each card shows coverage (how many verses have matches), average score, and match type distribution (s = same language, x = cross-lingual, w = within-poem).

Cross-Algorithm Discordance

For verses that appear in multiple algorithms, how much do their neighbor lists overlap? High discordance means the algorithms find fundamentally different similar verses. Low overlap suggests the algorithms capture complementary aspects of similarity.

Formulaic Verse Clusters

The 200 largest groups of near-identical verses found by RRF neighborhood clustering on the combined similarity graph.
Size = total verse occurrences in the cluster. Places = distinct collection locations.
Click a cluster row to see its verse variants, geographic spread, and explore links.
Use language filters and text search to narrow the table. Click column headers to sort.

Geographic Spread Charts

Scatter plot shows how cluster size relates to geographic distribution. Histogram shows the frequency of clusters by number of distinct places. Widely distributed formulas represent the most universal elements of Finnic oral poetry.

Algorithm Comparison Dashboard

Side-by-side statistics for each similarity algorithm. Match types: s = same language, x = cross-lingual, w = within-poem.

Cross-Lingual Match Percentage

Average Similarity Score

Verse Coverage (Verses with Matches)

Total Similarity Pairs

Cross-Algorithm Discordance

Pairwise comparison of how much the algorithms agree on their top neighbors for the same verses.

Formulaic Verse Clusters (Top 200)

The 200 largest groups of nearly identical verses found across the corpus. These represent formulaic expressions shared across poems and regions.

Geographic Spread of Formulaic Clusters

How widely the formulaic verse clusters are distributed across distinct collection places.

Cluster Size vs. Distinct Places

Distribution of Geographic Spread (# Places)