Tools - Digilab

Visual overview of the state of newspaper digitisation in Estonia

12.07.2023 Analysis, Newspapers

A number of Estonian newspapers have been digitised, but not all. Some of the information about the digitisation is included in the ENB periodicals database. Here we have created a visual tool to get an overview based on that information.

It is more convenient to use the tool on a separate page: https://digilab.shinyapps.io/digitized_newspapers/

Digitised newspapers in Estonia gives a visual overview of the state of newspaper digitisation in Estonia. The data comes from the periodicals section of the ENB. Not all newspapers digitised in the last few years are yet marked as such.

User guide

The visual tool allows the user to select a specific period (which is by default 1800–1950) and from there select the pages that have appeared in at least a certain number of years (the default selection is 10). These parameters can be changed by moving the arrows on the left-hand side of the application.

In addition, it is possible to change the thickness of the line, as the number of newspapers displayed can vary significantly with different parameters. While initially less than 100 newspapers are displayed, displaying all the different editions can draw more than 1200 lines.

The graph shows newspapers in simple or grouped form. The grouped version arranges pages that can be considered as a continuation of the previous ones on a single line.

The graph can be interactive or an image. In an interactive graph, it is possible to get more information about the pages by moving the mouse over the lines. By using the buttons on the top bar of the menu or by selecting an area of the graph, it is possible to take a closer look at some parts of the graph. In addition, it is possible to click on the lines of the digitised newspapers, which will take you to the home page of the digital collection hosting them and to the collection of that newspaper there, if available.

Note

The tool has been supported by the research project EKKD72 "Tekstiainese kasutusvõimalused digihumanitaaria juhtumiuuringutes Eesti ajalehekollektsioonide (1850-2020) näitel".

Data was last updated in April 2023. Data and code can be accessed at OSF https://doi.org/10.17605/OSF.IO/B2HPX.

Folli

Folli uses AI to analyze photos in museum collections

12.04.2023 Digital humanities, Memory institutions, Museums

DIGAR and ENB metadata downloader

Tool for downloading and converting RaRa metadata

27.03.2023 DIGAR, Estonian National Bibliography, Metadata, Python

Word n-grams in newspapers

Tool for visualising the frequency of word n-grams in newspapers

26.03.2023 Analysis, DEA, Newspapers

It is more convenient to use the tool on a separate page: https://digilab.shinyapps.io/dea_ngrams/

One way to get an idea of the content of large collections of texts is by looking at the frequency of the words and phrases in them – which words are more or less frequent in different parts of the corpus. For example, Google allows you to search for ngram frequencies over time in its digital collection (see here).

Along the same lines, we've aggregated the content of the DEA newspaper articles by year. These datasets are also downloadable for use on a personal computer.

Unlike Google ngram search, we do not show results for all sources at once. Notably, the content of the collections changes significantly over time and we think that a simple search across all of them may prove rather misleading.

In the demo application, you can search for one-, two- and three-word compounds (1, 2 and 3-grams).

Enter your search term in the text box above. You can choose between corpora of unmodified full texts (gram) and lemmatised texts (lemma). By clicking on the load vocabulary button, the application will load the lexicon of that corpus, which will allow you to display similar words in the lexicon when adding words. It takes a few seconds to load the lexicon.

In case of a successful search, you can see a graph on the right. This graph shows the word frequencies over time: the number of times the search term appeared per thousand words.

Currently, the demo application includes texts from the newspaper Postimees from 1880 to 1940.

Making ngrams from digitised material requires a number of simplifying steps. Errors can occur when converting illustrated printed text into machine-readable text: letters may be misplaced within words, words may be broken up into chunks, or random shadows on the paper may be mistaken for words. All of these are called OCR (Optical Character Recognition) errors.

In order to avoid the impact of these errors on the results, all word-like units containing only one letter have been excluded from the analyses.

To simplify the calculations, all words that have occurred fewer than 40 times in the collection and all years when those words occur fewer than 10 times, have been excluded.

Automatic keyword tagger MARTA

Estonian articles keyword tagger prototype

24.02.2023 Estonian, Subject indexing

Newspapers metadata browser

DEA newspapers metadata browser

17.10.2022 Analysis, DEA, Metadata

Access to DEA texts

Access to DEA newspapers full texts (JupyterLab, OAI)

09.10.2022 Analysis

In addition to the DEA user interface, there are times when direct access to the texts is needed. For this purpose, digiLab has adopted the JupyterLab environment, which allows direct access to the raw texts. It enables working with R or Python code, downloading data, saving analysis results to your computer, and sharing them with others.

To obtain a username for using the JupyterLab environment, please contact digilab@rara.ee.

When using and reusing texts, it is essential to observe the license conditions.

User guide

Figure 1. Content of the DEA collection. Open access is indicated by green, issues in red are accessible at an authorized workstation in the National Library of Estonia or under a special arrangement.

The Digitized Estonian Articles can be searched through the web interface at https://dea.digar.ee/?l=en and are also accessible as a dataset. An overview of the dataset is available on a separate page (in Estonian).
The data can be accessed through the cloud-based JupyterHub environment, where you can execute code and create Jupyter Notebooks using R and Python.
In the JupyterHub environment, there is access to full texts and metadata, the ability to conduct your own analysis, and the option to download your findings. The data is open for everyone to use.

Before you begin

To use the environment, it is necessary to obtain a username from ETAIS. For obtaining a username, please contact digilab@rara.ee.
For convenient access to the dataset, an R package called digar.txts has been created. It allows you to extract subsets from the entire collection and perform full-text searches.
When processing data, you have the option to use your own code, rely on sample analyses, or extract search results in tabular format. This provides flexibility in analyzing the data according to your needs and preferences.

Quickstart

Go to the website https://jupyter.hpc.ut.ee/ and log in using the username provided to you from digilab@rara.ee.
Choose the default settings (1 CPU core, 8 GB memory, 6h timelimit).
Create a new Notebook with an R kernel.
Use the code provided in the code usage section.
Upload the examples for beginners in either Estonian or English.
Explore the sample analyses: hobu, elekter, aur 20. saj algul (.html, .ipynb, .Rmd) (in Estonian).
Workshops: Nelijärve 5. nov 2020, Kevad 2020 eestikeelne lühikursus tekstitöötlusest R-is (in Estonian).

Starting with JupyterHub

Go to https://jupyter.hpc.ut.ee/ and log in

Choose the first option provided (1 CPU core, 8GB memory, 6h timelimit). This will open a data processing window for six hours. All your files will be permanently stored under your username.

Please wait while the machine starts up. It may take a few minutes depending on the queue. Sometimes refreshing the page can also help.

If the startup is successful, you should see something like this. On the left side, you can upload files (using the upload button or by dragging files into the box) or create new files (using the "+" sign). On the right side, there are code cells, notebooks, and materials. In the example, a new Jupyter Notebook is currently open.

In the Notebook, you can use either Python or R. When using a notebook, you need to select the appropriate computational environment (kernel) for them. This can be done when creating a new document or in an existing document by going to Kernel -> Change Kernel or by clicking on the kernel name in the top right corner. This will open the next view.

Access to texts is currently available through R. We recommend making an initial query using these tools and then using your preferred tools thereafter.

R package

Access to files is supported by the R package digar.txts. With a few simple commands, it allows you to:
1) Get an overview of the dataset with file associations
2) Create subsets of the dataset
3) Perform text searches
4) Extract immediate context from search results

You can also save search results as a table and continue working with a smaller subset of data elsewhere.

Here are the commands:

get_digar_overview() – Retrieves an overview of the entire collection (at a high level)
get_subset_meta() – Retrieves metadata for a subset of the data (at the article level)
do_subset_search() – Performs a search within a subset and saves the results to a file (article by article)
get_concordances() – Finds concordances in the search results, displaying the search term and its immediate context.

These commands allow you to explore the collection, obtain specific subsets of data, conduct searches, and extract relevant information for further analysis.

For intermediate processing, various R packages and commands are suitable. For processing in Python, the data should be collected and a new Python notebook should be created beforehand.

Use of code

First, install the required package.

suppressPackageStartupMessages(library(tidyverse,lib.loc="/gpfs/space/projects/digar_txt/R/4.3/"))
suppressPackageStartupMessages(library(tidytext,lib.loc="/gpfs/space/projects/digar_txt/R/4.3/"))

Activate the package that was installed, use:

suppressPackageStartupMessages(library(digar.txts,lib.loc="/gpfs/space/projects/digar_txt/R/4.3/"))

Use get_digar_overview() to get overview of the collections (issue-level).

all_issues <- get_digar_overview()

Build a custom subset through any tools in R. Here is a tidyverse style example.

library(tidyverse)
subset <- all_issues %>%
    filter(DocumentType=="NEWSPAPER") %>%
    filter(year>1880&year<1940) %>%
    filter(keyid=="postimeesew")

Get meta information on that subset with get_subset_meta(). If this information is reused, sometimes storing the data is useful wth the commented lines.

subset_meta <- get_subset_meta(subset)
#potentially write to file, for easier access if returning to it
#readr::write_tsv(subset_meta,"subset_meta_postimeesew1.tsv")
#subset_meta <- readr::read_tsv("subset_meta_postimeesew1.tsv")

Do a search with do_subset_search(). This exports the search results into a file. do_subset_search() ignores case.

do_subset_search(searchterm="lurich", searchfile="lurich1.txt",subset)

Read the search results. Use any R tools. It's useful to name the id and text columns id and txt.

texts <- fread("lurich1.txt",header=F)[,.(id=V1,txt=V2)]

Get concordances using the get_concordances() command.

concs <- get_concordances(searchterm="[Ll]urich",texts=texts,before=30,after=30,txt="txt",id="id")

Note that many sources have not been segmented into artilces during digitization. On them both meta and text information need to be accessed on the page level, where files are located in a different folder. The sequence for pages would be:

subset2 <- all_issues %>%
    filter(DocumentType=="NEWSPAPER") %>%
    filter(year>1951&year<2002) %>%
    filter(keyid=="stockholmstid")

# The subset2 from stockholstid has 0 issues with section-level data, but 2178 issues with page-level data. In this case pages should be used. When combining sources with page and section sources, custom combinations can be made based on the question at hand. Note that pages data includes also the sections data when available, so using both at the same time can bias the results.
# subset2 %>% filter(sections_exist==T) %>% nrow()
# subset2 %>% filter(pages_exist==T) %>% nrow()

subset_meta2 <- get_subset_meta(subset2, source="pages")

do_subset_search(searchterm="eesti", searchfile="eesti1.txt",subset2, source="pages")

Convenience suggestion: to use ctrl-shift-m to make %>% function in the JupyterLab as in RStudio, add this code in Settings -> Advanced Settings Editor… -> Keyboard Shortcuts, on the left in the User Preferences box.

{
    "shortcuts": [
         {
            "command": "notebook:replace-selection",
            "selector": ".jp-Notebook",
            "keys": ["Ctrl Shift M"],
            "args": {"text": '%>% '}
        }
    ]
}

Basic R commands

<- - save to variable
%>% - a 'pipe' that directs the output of a function to the input of the next one
filter() - filter your data
count() - count specific values
mutate() - make a new column
head(n) - show the first n lines

To access the JupyterHub environment, log in at jupyter.hpc.ut.ee.