Hide Filters
Show Filters
ORDER
TYPE





CATEGORY







See more

FORMAT









See more

COPYRIGHT TAGS
















See more

SOURCE






The network of Estonian translations of literature works
The network of Estonian translations of literature works.
See more

It is more convenient to use the tool on a separate page: https://data.digar.ee/tkirjandus/index.html

Folli
Folli uses AI to analyze photos in museum collections
Folli classifies images in museum collections into clusters using algorithmic similarity.
Zooming in on the cloud, you can see that visually close images are side by side.

Folli, an AI-based solution of the National Heritage Board, is designed to automatically describe and systematise visual material in museum collections. One important and time-consuming part of describing images is naming the objects they contain. Folli, on the other hand, can automatically find people, houses, furniture and other objects in photographs. This, in turn, improves search capabilities and allows more detailed analyses of the content of collections.

One part of the development of Folli is a demo application that visualises a collection of more than 250 000 photos based on visual similarity. Photos with similar content are placed closer together in the photo cloud, thus forming clusters based on visual themes. For example, in one part of the cloud you might find Soviet-era photos of a birthday party, in another part of the cloud you might find photos of a zeppelin whizzing over Tallinn, and so on.

You can find the demo application here: http://folli.stacc.cloud/demo
(the application may take a few minutes to start)

Folli uses a numeric vector shape to compare images. The images are processed using the InceptionV3 artificial neural network and the UMAP dimensionality reduction algorithm. The demo application's user interface uses PixPlot, a solution developed at Yale University, for the final visualization.

STACC OÜ is behind the technical implementation of the application. The work has been commissioned by the National Heritage Board of Estonia in cooperation with the National Library and the project has been funded by the European Regional Development Fund.

DIGAR and ENB metadata downloader
Tool for downloading and converting RaRa metadata

With this Python module, you can download the metadata collections of DIGAR (National Library of Estonia Digital Archives) and ENB (Estonian National Bibliography) by yourself. The module also provides functions to convert them from XML to TSV and JSON formats.

The module, along with the installation and usage guide, can be found here: https://github.com/RaRa-digiLab/metadata-handler

Since the datasets are also available on our datasets page, the module is primarily suitable for automated solutions and other experimentation purposes.

Word n-grams in newspapers
Tool for visualising the frequency of word n-grams in newspapers

The application can also be found on the following standalone page: https://digilab.shinyapps.io/dea_ngrams/

One way to get an idea of the content of large collections of texts is by looking at the frequency of the words and phrases in them – which words are more or less frequent in different parts of the corpus. For example, Google allows you to search for ngram frequencies over time in its digital collection (see here).

Along the same lines, we've aggregated the content of the DEA newspaper articles by year. These datasets are also downloadable for use on a personal computer.

Unlike Google ngram search, we do not show results for all sources at once. Notably, the content of the collections changes significantly over time and we think that a simple search across all of them may prove rather misleading.

In the demo application, you can search for one-, two- and three-word compounds (1, 2 and 3-grams).

Enter your search term in the text box above. You can choose between corpora of unmodified full texts (gram) and lemmatised texts (lemma). By clicking on the load vocabulary button, the application will load the lexicon of that corpus, which will allow you to display similar words in the lexicon when adding words. It takes a few seconds to load the lexicon.

In case of a successful search, you can see a graph on the right. This graph shows the word frequencies over time: the number of times the search term appeared per thousand words.

Currently, the demo application includes texts from the newspaper Postimees from 1880 to 1940.

Making ngrams from digitised material requires a number of simplifying steps. Errors can occur when converting illustrated printed text into machine-readable text: letters may be misplaced within words, words may be broken up into chunks, or random shadows on the paper may be mistaken for words. All of these are called OCR (Optical Character Recognition) errors.

In order to avoid the impact of these errors on the results, all word-like units containing only one letter have been excluded from the analyses.

To simplify the calculations, all words that have occurred fewer than 40 times in the collection and all years when those words occur fewer than 10 times, have been excluded.

Automatic keyword tagger MARTA
Estonian articles keyword tagger prototype

The tool can be found here: https://marta.nlib.ee (website is temporarily unavailable)

MARTA is a prototype for automatic keyword tagging of Estonian articles. The prototype takes text as input (either in plain text format, downloaded from a given URL, or extracted from an uploaded file). Optionally, the user can select applicable methodologies and/or article domains. In the next step, the text is lemmatized and part-of-speech tags are assigned using the MLP10 (multilingual preprocessor) tool from Texta Toolkit. After lemmatization, keyword tagging methods are applied to extract the following keywords from the text:

  • Topical keywords
  • Personal names
  • Locations and geopolitical entities
  • Organizations
  • Temporal keywords

The detected keywords are compared with the Estonian Thesaurus (EMS). If a keyword also appears in the EMS, a checkmark is displayed next to it. The identified keywords can be exported from the application in MARC format.

You can find a more detailed user guide for the prototype here (in Estonian).

Newspapers metadata browser
DEA newspapers metadata browser

When working with data, it is essential to have a good understanding of your dataset: the sources of the data, how it has been processed, and what aspects can be trusted or not. For specific analyses, it is advisable to structure the dataset in a way that aligns with the research objectives and analytical tools being used.

To facilitate obtaining an overview, digiLab provides the DEA (Digitized Estonian Articles) metadata browser. It is a visual environment for gaining insights into the content of the dataset. The metadata browser operates on extracted metadata from the accessible collection. In the JupyterHub environment, you can access the same metadata with the following command.

all_issues <- get_digar_overview()

The application is also accessible in a separate Shiny environment.

Access to DEA texts
Access to DEA newspapers full texts (JupyterLab, OAI)

In addition to the DEA user interface, there are times when direct access to the texts is needed. For this purpose, digiLab has adopted the JupyterLab environment, which allows direct access to the raw texts. It enables working with R or Python code, downloading data, saving analysis results to your computer, and sharing them with others.

To obtain a username for using the JupyterLab environment, please contact digilab@rara.ee.

When using and reusing texts, it is essential to observe the license conditions.

User guide

Figure 1. Content of the DEA collection. Open access is indicated by green, issues in red are accessible at an authorized workstation in the National Library of Estonia or under a special arrangement.
  • The Digitized Estonian Articles can be searched through the web interface at https://dea.digar.ee/?l=en and are also accessible as a dataset. An overview of the dataset is available on a separate page (in Estonian).
  • The data can be accessed through the cloud-based JupyterHub environment, where you can execute code and create Jupyter Notebooks using R and Python.
  • In the JupyterHub environment, there is access to full texts and metadata, the ability to conduct your own analysis, and the option to download your findings. The data is open for everyone to use.

Before you begin

  • To use the environment, it is necessary to obtain a username from ETAIS. For obtaining a username, please contact data@digar.ee or digilab@rara.ee.
  • For convenient access to the dataset, an R package called digar.txts has been created. It allows you to extract subsets from the entire collection and perform full-text searches.
  • When processing data, you have the option to use your own code, rely on sample analyses, or extract search results in tabular format. This provides flexibility in analyzing the data according to your needs and preferences.

Quickstart


JupyterHub-iga alustamine

  • Choose the first option provided (1 CPU core, 8GB memory, 6h timelimit). This will open a data processing window for six hours. All your files will be permanently stored under your username.
  • Please wait while the machine starts up. It may take a few minutes depending on the queue. Sometimes refreshing the page can also help.
  • If the startup is successful, you should see something like this. On the left side, you can upload files (using the upload button or by dragging files into the box) or create new files (using the "+" sign). On the right side, there are code cells, notebooks, and materials. In the example, a new Jupyter Notebook is currently open.
  • In the Notebook, you can use either Python or R. When using a notebook, you need to select the appropriate computational environment (kernel) for them. This can be done when creating a new document or in an existing document by going to Kernel -> Change Kernel or by clicking on the kernel name in the top right corner. This will open the next view.
  • Access to texts is currently available through R. We recommend making an initial query using these tools and then using your preferred tools thereafter.

R package

Access to files is supported by the R package digar.txts. With a few simple commands, it allows you to:
1) Get an overview of the dataset with file associations
2) Create subsets of the dataset
3) Perform text searches
4) Extract immediate context from search results

You can also save search results as a table and continue working with a smaller subset of data elsewhere.

Here are the commands:

  • get_digar_overview(): Retrieves an overview of the entire collection (at a high level)
  • get_subset_meta(): Retrieves metadata for a subset of the data (at the article level)
  • do_subset_search(): Performs a search within a subset and saves the results to a file (article by article)
  • get_concordances(): Finds concordances in the search results, displaying the search term and its immediate context.

These commands allow you to explore the collection, obtain specific subsets of data, conduct searches, and extract relevant information for further analysis.

For intermediate processing, various R packages and commands are suitable. For processing in Python, the data should be collected and a new Python notebook should be created beforehand.

Use of code

  1. First, install the required package
#Install package remotes, if needed. JupyterLab should have it.
#install.packages("remotes")

#Since the JypiterLab that we use does not have write-access to 
#all the files, we specify a local folder for our packages.
dir.create("R_pckg")
remotes::install_github("peeter-t2/digar.txts",lib="~/R_pckg/",upgrade="never")
  1. Activate the package that was installed, use
library(digar.txts,lib.loc="~/R_pckg/")
  1. Use get_digar_overview() to get overview of the collections (issue-level).
all_issues <- get_digar_overview()
  1. Build a custom subset through any tools in R. Here is a tidyverse style example.
library(tidyverse)
subset <- all_issues %>%
    filter(DocumentType=="NEWSPAPER") %>%
    filter(year>1880&year<1940) %>%
    filter(keyid=="postimeesew")
  1. Get meta information on that subset with get_subset_meta(). If this information is reused, sometimes storing the data is useful wth the commented lines.
subset_meta <- get_subset_meta(subset)
#potentially write to file, for easier access if returning to it
#readr::write_tsv(subset_meta,"subset_meta_postimeesew1.tsv")
#subset_meta <- readr::read_tsv("subset_meta_postimeesew1.tsv")
  1. Do a search with do_subset_search(). This exports the search results into a file. do_subset_search() ignores case.
do_subset_search(searchterm="lurich", searchfile="lurich1.txt",subset)
  1. Read the search results. Use any R tools. It's useful to name the id and text columns id and txt.
texts <- fread("lurich1.txt",header=F)[,.(id=V1,txt=V2)]
  1. Get concordances using the get_concordances() command
concs <- get_concordances(searchterm="[Ll]urich",texts=texts,before=30,after=30,txt="txt",id="id")
  1. Note that many sources have not been segmented into artilces during digitization. On them both meta and text information need to be accessed on the page level, where files are located in a different folder. The sequence for pages would be:
subset2 <- all_issues %>%
    filter(DocumentType=="NEWSPAPER") %>%
    filter(year>1951&year<2002) %>%
    filter(keyid=="stockholmstid")

# The subset2 from stockholstid has 0 issues with section-level data, but 2178 issues with page-level data. In this case pages should be used. When combining sources with page and section sources, custom combinations can be made based on the question at hand. Note that pages data includes also the sections data when available, so using both at the same time can bias the results.
# subset2 %>% filter(sections_exist==T) %>% nrow()
# subset2 %>% filter(pages_exist==T) %>% nrow()

subset_meta2 <- get_subset_meta(subset2, source="pages")

do_subset_search(searchterm="eesti", searchfile="eesti1.txt",subset2, source="pages")

Convenience suggestion: to use ctrl-shift-m to make %>% function in the JupyterLab as in RStudio, add this code in Settings -> Advanced Settings Editor… -> Keyboard Shortcuts, on the left in the User Preferences box.

{
    "shortcuts": [
         {
            "command": "notebook:replace-selection",
            "selector": ".jp-Notebook",
            "keys": ["Ctrl Shift M"],
            "args": {"text": '%>% '}
        }
    ]
}

Basic R commands

  • <- - save to variable
  • %>% - a 'pipe' that directs the output of a function to the input of the next one
  • filter() - filter your data
  • count() - count specific values
  • mutate() - make a new column
  • head(n) - show the first n lines

To access the JupyterHub environment, log in at jupyter.hpc.ut.ee.

Mapping books
Mapping books (Estonian National Bibliography)

In the metadata of the Estonian National Bibliography, the publication location of most books is recorded. In this tool, a connection has been established between place names and coordinates to provide an overview of the dataset, and the books have been placed on a map based on their place of publication.

The map displays the locations where books were published for each decade and provides information on the number of books published. You can change the displayed time period on the map, zoom in or out, and focus on specific locations. The size of the circle represents the number of books published in that location during the respective years – the larger the circle, the more books were published there.

The points on the map can be clicked to access the corresponding records in Ester, which is a larger database compared to the Estonian National Bibliography. Ester includes not only works directly related to Estonia but also other works that may have some connection to Estonia.

Note: There may be some inaccuracies in marking the geographical locations, primarily due to the existence of multiple places with the same name in the world. These errors are being addressed and corrected.

Digital Humanities Series (in Estonian)
Digital Humanities Series of DIGAR, DEA and ENB

DIGAR – National Library of Estonia’s digital archive & DEA – DIGAR Estonian Articles

In this episode we will take a closer look at the National Library’s digital archive DIGAR and its Estonian Articles portal DEA – what it is, how to find machine readable data, what data is available for research and text and data mining and how to use digital methods while analysing the dataset.

This video was created as a result of the international project Going beyond Search: advancing digital competencies in libraries and research communities.
Financial support: Nordplus
Project partners: National Library of Latvia, National Library of Estonia, Martynas Mazvydas National Library of Lithuania, Institute of Literature, Folklore and Art at the University of Latvia, and Humlab at Umeå University

ENB – Estonian National Bibliography

In this episode we will take a closer look at the Estonian National Bibliography – a repository of metadata of Estonian print output and online publications production. This set of metadata is a valuable reseource for researchers to give an overview of the developments in Estonian culture, language, people, society and economy starting from 16th centurty. We will learn where to find the machine readable data of the bibliography and how to use digital methods to analyse this dataset.

This video was created as a result of the international project Going beyond Search: advancing digital competencies in libraries and research communities.
Financial support: Nordplus
Project partners: National Library of Latvia, National Library of Estonia, Martynas Mazvydas National Library of Lithuania, Institute of Literature, Folklore and Art at the University of Latvia, and Humlab at Umeå University
Näita veel

Sign up to the National Library Newsletter

    OPEN
    RaRa small building
    Mon-Fri 10—20
    Sat 12—19
    Sun Closed

    Solaris Embassy
    Mon-Sun 10—19
    CONTACT

    National Library of Estonia
    Narva Road 11, 15015 Tallinn
    +372 630 7100
    info@rara.ee
    rara.ee/en

    linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram