Word n-grams in newspapers

It is more convenient to use the tool on a separate page: https://digilab.shinyapps.io/dea_ngrams/

One way to get an idea of the content of large collections of texts is by looking at the frequency of the words and phrases in them – which words are more or less frequent in different parts of the corpus. For example, Google allows you to search for ngram frequencies over time in its digital collection (see here).

Along the same lines, we've aggregated the content of the DEA newspaper articles by year. These datasets are also downloadable for use on a personal computer.

Unlike Google ngram search, we do not show results for all sources at once. Notably, the content of the collections changes significantly over time and we think that a simple search across all of them may prove rather misleading.

In the demo application, you can search for one-, two- and three-word compounds (1, 2 and 3-grams).

Enter your search term in the text box above. You can choose between corpora of unmodified full texts (gram) and lemmatised texts (lemma). By clicking on the load vocabulary button, the application will load the lexicon of that corpus, which will allow you to display similar words in the lexicon when adding words. It takes a few seconds to load the lexicon.

In case of a successful search, you can see a graph on the right. This graph shows the word frequencies over time: the number of times the search term appeared per thousand words.

Currently, the demo application includes texts from the newspaper Postimees from 1880 to 1940.

Making ngrams from digitised material requires a number of simplifying steps. Errors can occur when converting illustrated printed text into machine-readable text: letters may be misplaced within words, words may be broken up into chunks, or random shadows on the paper may be mistaken for words. All of these are called OCR (Optical Character Recognition) errors.

In order to avoid the impact of these errors on the results, all word-like units containing only one letter have been excluded from the analyses.

To simplify the calculations, all words that have occurred fewer than 40 times in the collection and all years when those words occur fewer than 10 times, have been excluded.

Word n-grams in newspapers

Sign up to the National Library Newsletter