Access to DEA texts

In addition to the DEA user interface, there are times when direct access to the texts is needed. For this purpose, digiLab has adopted the JupyterLab environment, which allows direct access to the raw texts. It enables working with R or Python code, downloading data, saving analysis results to your computer, and sharing them with others.

To obtain a username for using the JupyterLab environment, please contact digilab@rara.ee.

When using and reusing texts, it is essential to observe the license conditions.

User guide

Figure 1. Content of the DEA collection. Open access is indicated by green, issues in red are accessible at an authorized workstation in the National Library of Estonia or under a special arrangement.

The Digitized Estonian Articles can be searched through the web interface at https://dea.digar.ee/?l=en and are also accessible as a dataset. An overview of the dataset is available on a separate page (in Estonian).
The data can be accessed through the cloud-based JupyterHub environment, where you can execute code and create Jupyter Notebooks using R and Python.
In the JupyterHub environment, there is access to full texts and metadata, the ability to conduct your own analysis, and the option to download your findings. The data is open for everyone to use.

Before you begin

To use the environment, it is necessary to obtain a username from ETAIS. For obtaining a username, please contact digilab@rara.ee.
For convenient access to the dataset, an R package called digar.txts has been created. It allows you to extract subsets from the entire collection and perform full-text searches.
When processing data, you have the option to use your own code, rely on sample analyses, or extract search results in tabular format. This provides flexibility in analyzing the data according to your needs and preferences.

Quickstart

Go to the website https://jupyter.hpc.ut.ee/ and log in using the username provided to you from digilab@rara.ee.
Choose the default settings (1 CPU core, 8 GB memory, 6h timelimit).
Create a new Notebook with an R kernel.
Use the code provided in the code usage section.
Upload the examples for beginners in either Estonian or English.
Explore the sample analyses: hobu, elekter, aur 20. saj algul (.html, .ipynb, .Rmd) (in Estonian).
Workshops: Nelijärve 5. nov 2020, Kevad 2020 eestikeelne lühikursus tekstitöötlusest R-is (in Estonian).

Starting with JupyterHub

Go to https://jupyter.hpc.ut.ee/ and log in

Choose the first option provided (1 CPU core, 8GB memory, 6h timelimit). This will open a data processing window for six hours. All your files will be permanently stored under your username.

Please wait while the machine starts up. It may take a few minutes depending on the queue. Sometimes refreshing the page can also help.

If the startup is successful, you should see something like this. On the left side, you can upload files (using the upload button or by dragging files into the box) or create new files (using the "+" sign). On the right side, there are code cells, notebooks, and materials. In the example, a new Jupyter Notebook is currently open.

In the Notebook, you can use either Python or R. When using a notebook, you need to select the appropriate computational environment (kernel) for them. This can be done when creating a new document or in an existing document by going to Kernel -> Change Kernel or by clicking on the kernel name in the top right corner. This will open the next view.

Access to texts is currently available through R. We recommend making an initial query using these tools and then using your preferred tools thereafter.

R package

Access to files is supported by the R package digar.txts. With a few simple commands, it allows you to:
1) Get an overview of the dataset with file associations
2) Create subsets of the dataset
3) Perform text searches
4) Extract immediate context from search results

You can also save search results as a table and continue working with a smaller subset of data elsewhere.

Here are the commands:

get_digar_overview() – Retrieves an overview of the entire collection (at a high level)
get_subset_meta() – Retrieves metadata for a subset of the data (at the article level)
do_subset_search() – Performs a search within a subset and saves the results to a file (article by article)
get_concordances() – Finds concordances in the search results, displaying the search term and its immediate context.

These commands allow you to explore the collection, obtain specific subsets of data, conduct searches, and extract relevant information for further analysis.

For intermediate processing, various R packages and commands are suitable. For processing in Python, the data should be collected and a new Python notebook should be created beforehand.

Use of code

First, install the required package.

suppressPackageStartupMessages(library(tidyverse,lib.loc="/gpfs/space/projects/digar_txt/R/4.3/"))
suppressPackageStartupMessages(library(tidytext,lib.loc="/gpfs/space/projects/digar_txt/R/4.3/"))

Activate the package that was installed, use:

suppressPackageStartupMessages(library(digar.txts,lib.loc="/gpfs/space/projects/digar_txt/R/4.3/"))

Use get_digar_overview() to get overview of the collections (issue-level).

all_issues <- get_digar_overview()

Build a custom subset through any tools in R. Here is a tidyverse style example.

library(tidyverse)
subset <- all_issues %>%
    filter(DocumentType=="NEWSPAPER") %>%
    filter(year>1880&year<1940) %>%
    filter(keyid=="postimeesew")

Get meta information on that subset with get_subset_meta(). If this information is reused, sometimes storing the data is useful wth the commented lines.

subset_meta <- get_subset_meta(subset)
#potentially write to file, for easier access if returning to it
#readr::write_tsv(subset_meta,"subset_meta_postimeesew1.tsv")
#subset_meta <- readr::read_tsv("subset_meta_postimeesew1.tsv")

Do a search with do_subset_search(). This exports the search results into a file. do_subset_search() ignores case.

do_subset_search(searchterm="lurich", searchfile="lurich1.txt",subset)

Read the search results. Use any R tools. It's useful to name the id and text columns id and txt.

texts <- fread("lurich1.txt",header=F)[,.(id=V1,txt=V2)]

Get concordances using the get_concordances() command.

concs <- get_concordances(searchterm="[Ll]urich",texts=texts,before=30,after=30,txt="txt",id="id")

Note that many sources have not been segmented into artilces during digitization. On them both meta and text information need to be accessed on the page level, where files are located in a different folder. The sequence for pages would be:

subset2 <- all_issues %>%
    filter(DocumentType=="NEWSPAPER") %>%
    filter(year>1951&year<2002) %>%
    filter(keyid=="stockholmstid")

# The subset2 from stockholstid has 0 issues with section-level data, but 2178 issues with page-level data. In this case pages should be used. When combining sources with page and section sources, custom combinations can be made based on the question at hand. Note that pages data includes also the sections data when available, so using both at the same time can bias the results.
# subset2 %>% filter(sections_exist==T) %>% nrow()
# subset2 %>% filter(pages_exist==T) %>% nrow()

subset_meta2 <- get_subset_meta(subset2, source="pages")

do_subset_search(searchterm="eesti", searchfile="eesti1.txt",subset2, source="pages")

Convenience suggestion: to use ctrl-shift-m to make %>% function in the JupyterLab as in RStudio, add this code in Settings -> Advanced Settings Editor… -> Keyboard Shortcuts, on the left in the User Preferences box.

{
    "shortcuts": [
         {
            "command": "notebook:replace-selection",
            "selector": ".jp-Notebook",
            "keys": ["Ctrl Shift M"],
            "args": {"text": '%>% '}
        }
    ]
}

Basic R commands

<- - save to variable
%>% - a 'pipe' that directs the output of a function to the input of the next one
filter() - filter your data
count() - count specific values
mutate() - make a new column
head(n) - show the first n lines

To access the JupyterHub environment, log in at jupyter.hpc.ut.ee.