Complexity
Information-theoretic and lexical metrics computed per book and aggregated by genre. A quantitative view of how the text varies in structure, vocabulary, and compressibility.
Genre Comparison
Intertextuality Heatmap
Book×book cosine similarity from mean verse embeddings. Warmer colors indicate higher semantic similarity between books.
Hapax Legomena
Words appearing exactly once in the entire KJV Bible. Their distribution across books is informationally non-trivial — they often mark unique narrative moments, technical terms, or translation artifacts.
What you're seeing
Shannon entropy measures the unpredictability of the next token given the book's word distribution (higher = more diverse vocabulary usage). Compression ratio is gzip-compressed size ÷ raw size (lower = more repetitive/patterned text). Type-token ratio is unique words ÷ total words (higher = richer vocabulary relative to length). Hapax ratio is words-used-once ÷ unique words within each book. The heatmap shows cosine similarity between per-book mean embedding vectors.
Try this
- Sort by Shannon entropy descending — the most "surprising" books are short epistles with diverse vocabulary packed into few verses.
- Compare compression ratio across genres: narrative (History) compresses differently from poetry (Wisdom) or legal text (Law).
- In the heatmap, look for off-diagonal bright spots — these reveal unexpected semantic kinship across the canon.