Verses
Tokens
Unique Words
Hapax Legomena
Loading metrics

Genre Comparison

Loading genre data

Intertextuality Heatmap

Book×book cosine similarity from mean verse embeddings. Warmer colors indicate higher semantic similarity between books.

Loading heatmap

Hapax Legomena

Words appearing exactly once in the entire KJV Bible. Their distribution across books is informationally non-trivial — they often mark unique narrative moments, technical terms, or translation artifacts.

Loading hapax data

What you're seeing

Shannon entropy measures the unpredictability of the next token given the book's word distribution (higher = more diverse vocabulary usage). Compression ratio is gzip-compressed size ÷ raw size (lower = more repetitive/patterned text). Type-token ratio is unique words ÷ total words (higher = richer vocabulary relative to length). Hapax ratio is words-used-once ÷ unique words within each book. The heatmap shows cosine similarity between per-book mean embedding vectors.

Try this

  • Sort by Shannon entropy descending — the most "surprising" books are short epistles with diverse vocabulary packed into few verses.
  • Compare compression ratio across genres: narrative (History) compresses differently from poetry (Wisdom) or legal text (Law).
  • In the heatmap, look for off-diagonal bright spots — these reveal unexpected semantic kinship across the canon.