EEBO-TCP statistics

How character frequencies vary in time. Uses 25-year periods. Focuses on the most common characters, with a relative frequency of at least 1/10000 in the entire corpus. Character frequencies are compared with the total number of characters in each time period.

How word frequencies vary in time. Uses 50-year periods. Organised by the letters occurring in the words. For each letter, shows the 25 most common words. Within each letter, the words are ordered roughly by their frequency variation: first those words that appeared more frequently in early texts, in the middle those words that occur throughout time more or less evenly, and finally those words that are more common in the later parts of the corpus. The numbers on the left are overall rankings, with “1” denoting the most common word in the entire corpus (“the”). Word frequencies are compared with the frequency of the most common word, so “the” will be a flat line in the bar chart.

Green shades indicate large values (characters or words that are relatively common in this period), purple shades indicate small values.

Data source

Freely available texts from EEBO-TCP. To download everything (approx. 8 GB):

    git clone https://github.com/textcreationpartnership/Texts.git eebo-tcp
    cd eebo-tcp
    ./graball.sh

How counted?

All texts with the following properties were processed:

The “Status” field contains the value “Free”.
The “EEBO” field is non-empty
The “Date” field contains a single valid year before 1800.

If the Date field contains a range of years, question marks, etc., the entire text is skipped. The total number of texts that were processed is 24,910 (out of 25,368 freely available texts from EEBO-TCP).

XML tags are ignored. All non-XML characters inside the <text> element are counted. Multiple consecutive white space characters (space, non-breaking space, tab, newline, etc.) are collapsed in a single space. No spelling normalisation is done.

Each Unicode character counts as a separate character. In particular, upper and lower case characters are treated as separate characters.

A word consists of a sequence of letters. For example, “God's” counts as two words, “God” and “s”.

Tools

C++, Python, pugixml for XML parsing, UTF8-CPP for UTF8 parsing, LXML for HTML output.

pugixml is fast; you can parse the entire EEBO-TCP, with 8 GB of XML, in a matter of seconds.

Contact

Jukka Suomela