Datasets - Digilab

01.03.2023 Big data, Digital archives, Libraries, Memory institutions

Read more about the possibilities of our data

Currently, Digilab primarily includes datasets created based on the digital collections of the National Library of Estonia. In addition, Digilab also includes data from the Estonian Thesaurus, and in the future, other cultural heritage datasets will be added as well.

RaRa datasets consist of three main components: the digital archive DIGAR, the digitized Estonian Articles DEA, and the Estonian National Bibliography ENB. DIGAR contains various types of data, such as books, periodicals, maps, sheet music, and postcards. These can be accessed separately in Digilab. DEA primarily includes newspaper texts but also includes more recent periodicals. ENB contains metadata about print publications published in or related to Estonia. The datasets in Digilab are created to provide direct access to the underlying data behind the user interface of the digital collections. The datasets and associated information are continually updated.

Estonian National Bibliography

The Estonian National Bibliography database ERB (www.ester.ee/search~S95*eng) registers data about national publications. National publications include all publications published in Estonia in all languages and publications published in Estonian abroad, including works by Estonian authors and their translations, regardless of their physical format (paper, electronic). The principles for compiling ERB are defined in the document "Principles of National Bibliography Compilation." The database is continuously updated with new data, at least once a week.

During the registration process, a detailed description is created for each publication based on the information contained in the publication. The description includes the title, information about the responsible individuals and organizations for the publication, publisher and printing house details, edition information, physical description (pages, dimensions, etc.), and affiliation to any series. In addition, search features such as keywords, subject indexes, and standardized forms of names for related individuals and organizations are added to the description.

All data in the database adhere to international standards:

ISBD (International Standard Bibliographic Description) – for descriptive data;
AACR2 (Anglo-American Cataloguing Rules 2) – for search features;
UDC (Universal Decimal Classification) – for subject indexes;
MARC21 – used as the data exchange format.

The open data in ENB is categorized into groups based on material types: books, periodicals (journals, newspapers, continuing resources), maps, scores, video recordings, audio recordings, image materials, and multimedia resources. In the case of books, the data is further divided into Estonian books and non-Estonian books.

The data of individuals and collectives in the Estonian National Bibliography is also separately available.

The National Library of Estonia's digital archive DIGAR

DIGAR (www.digar.ee/arhiiv/en) is the digital archive of the National Library of Estonia, providing access to publications stored in the digital archive. It includes e-books, newspapers, magazines, maps, sheet music, photos, postcards, posters, illustrations, audiobooks, and music files. The format of books and periodicals is mostly PDF or EPUB, while image materials are in JPEG format, and audio recordings are in WAV format.

Digitized Estonian Articles DEA

DIGAR Estonian Articles (dea.digar.ee/?l=en) provides access to digitally born and digitized newspapers published in Estonia throughout history, as well as Estonian-language publications from abroad. It includes newspapers, journals, and ongoing publications registered in the annual publication "Estonian National Bibliography. Periodicals" since 2017.

The portal allows users to browse publications, search for content within newspapers, read full-text articles, add keywords to articles, create lists of found articles, and send them via email. Users can also share discovered information on social networks and perform other actions.

Access is provided to newspapers published since 2014, journals and ongoing publications since 2017, and partially to older newspapers. The portal is updated daily. Older newspapers (1821-2013) are gradually added according to a conversion plan.

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Ut elit tellus, luctus nec ullamcorper mattis, pulvinar dapibus leo.

Public corpus

A corpus of public domain publications

16.04.2026 Big data, Books, Collaboration, Full text, Journals, Newspapers, Serials, Standards

Curated Estonian National Bibliography

ENB metadata optimised for data processing

29.11.2024 Books, Estonian National Bibliography, Metadata, Persons

Read about the creation of the curated Estonian National Bibliography in the article Curated Bibliographic Data: The Case of the Estonian National Bibliography, published in the Journal of Open Humanities Data.

The Estonian National Bibliography (ENB) collects information on works in Estonian, published in Estonia, created by Estonian authors, or about Estonia and Estonians. Details regarding the compilation of the ENB are described on the library's website (in Estonian).

The metadata of the ENB is published by publication type (books, newspapers, audio recordings, etc.) in the MARC21XML format. MARC21 is a file format used in libraries for cataloging information, however, it is not intended for data processing or analysis. With the launch of the Digilab platform, it became possible to download ENB metadata in table format, containing the most commonly filled columns and including initial data cleaning. ENB metadata sets can be found in the Digilab datasets section.

An open-source workflow has now been developed, which converts the original MARC21XML files into curated datasets in table format. The curation is demonstrated in the cleaning and customisation of the datasets, based on user preferences and the specifics of digital humanities research methods. Currently, the curation process is focused on datasets for books and persons. More information on the creation of the curated ENB will be available in an research article, titled "Curated Bibliographic Data: the Case of Estonian National Bibliography", as well as in the documentation for the workflow's source code. The datasets obtained through the workflow have been uploaded as TSV files to the Zenodo repository.

The source code of the ENB Curator workflow on GitHub: https://github.com/RaRa-digiLab/enb-curator

Curated ENB book metadata in the Zenodo repository:

TSV

Examples of information included in the dataset:

Book title and subtitle(s)
Contributors to the work
Publisher and printing press or printer
Year and decade of publication
Place of publication and its coordinates
Language information, including the original language of the book in the case of translated works
Subject and genre keywords
Keywords referring to geographical areas, time periods, organisations, or individuals
Book dimensions and page count
Digitisation status, including references to the digital version(s) of the book, if available

How to cite the dataset:

Kruusmaa, K., Tinits, P., & Nemvalts, L. (2025). Curated Estonian National Bibliography - books [Dataset]. National Library of Estonia. https://doi.org/10.5281/zenodo.14083326

Curated ENB persons metadata in the Zenodo repository:

TSV

Examples of information included in the dataset:

Person's name and its various spellings
Year of birth, and year of death, if applicable
Occupation
Gender
Geographical area of activity
Short biography
VIAF (Virtual International Authority File) identifier
Wikidata identifier

How to cite the dataset:

Kruusmaa, K., Tinits, P., & Nemvalts, L. (2025). Curated Estonian National Bibliography - persons. [Dataset]. National Library of Estonia. https://doi.org/10.5281/zenodo.14094583

We encourage the exploration and analysis of the datasets! Any questions or suggestions can be sent to the email address digilab@rara.ee. We also welcome data stories based on the datasets for the Digilab blog.

The thumbnail introducing the curated ENB is from the National Archives photo collection (RA, EAA.2111.1.14967).

OCR Text Corrections

Text corrections of newspapers created as collaborative work in DEA

16.08.2024 DIGAR, Estonian, Newspapers, OCR

Over the years, nearly 500 users have contributed to correcting the texts in DIGAR, and as a result of their efforts, more than 800,000 lines have been corrected. This dataset contains a selection of pairs of original texts and their corrections.

The dataset, along with its documentation, can be found here: https://zenodo.org/records/13325713

The collection of text corrections in the DIGAR environment has been carried out through collaborative creation.

Preprocessing

The text corrections in the DIGAR archive are saved as change logs, meaning the original text has been reverse-engineered, with the corrected parts replaced by the original content. The texts are heavily filtered. Specifically, only text correction pairs that meet the following criteria are included:

The corrected text contains at least 80% alphabetical characters.
The difference in length between the original texts and the corrected texts does not exceed 5%.
The relative Levenshtein distance between the two texts is at least 0.1.

These criteria are used to exclude texts that are partially edited, contain too many numbers, lists, or other non-alphabetical symbols, or where significant parts have been deleted or added (often to correct segmentation errors).

Quality Assessment

Since the corrections are the result of collaborative creation, they may contain errors and should not be considered the final truth. To provide a rough overview of the quality of the corrected texts, both the original and corrected texts have been processed through GPT-4o mini, which assigned them a readability score ranging from 1 to 5. The following scale was used for this assessment:

The following is the OCR output from a digitized historical Estonian newspaper from {year}. Analyze the text placed after "TEXT" and decide if it is reasonably free of OCR errors. Return a rating on the scale of 1 to 5.

5 - The text is clear and readable. It may contain unusual spellings and use of punctuation throughout, but there are no distorted words.
4 - The text is readable, but contains some distortions of alphabetical characters. These distortions do not impede understanding the text at any given point.
3 - The text is readable with minor difficulties. Words and phrases may be noticeably distorted.
2 - The text is only readable with great difficulties. All or almost all sentences contain severe errors that make it very hard to understand.
1 - The text is unreadable. It contains mostly gibberish and random symbols, almost no words are recognizable.

If you are hesitating between 4 and 5, it is probably a 5. If you are hesitating between 2 and 3, it is probably a 2.

Note: the use of "w" instead of "v" and "=" instead of "-" are elements of historical orthography an do not count as errors.

Do not reply anything else than a number from 1 to 5, unless explicitly asked to do so.

TEXT:
{ocr_transcription}

Books

Books metadata from ENB and DIGAR

01.03.2023 Books, Metadata, Standards

ENB
DIGAR

Estonian National Bibliography

This page discusses minimally cleaned and raw datasets. For data analysis, we recommend using the curated Estonian National Bibliography.

Books in Estonian

A subset of book metadata from the Estonian national bibliography. A subset of Estonian language books metadata from the ENB is accessible:

In Zenodo repository as a TSV file:

Via the OAI-PMH protocol in MARC21XML format:

As a ZIP file in MARC21XML format:

**Dataset's distribution in time, language and topics**

How to cite the dataset:

National Library of Estonia. (2023). Estonian National Bibliography – books in Estonian [Dataset]. https://doi.org/10.5281/zenodo.8228805

Foreign language books

A subset of works in other languages than Estonian from the Estonian national bibliography. A subset of other language books metadata from the ENB is accessible:

In Zenodo repository as a TSV file:

Via the OAI-PMH protocol in MARC21XML format:

As a ZIP file in MARC21XML format:

How to cite the dataset:

National Library of Estonia. (2023). Estonian National Bibliography – foreign language books [Dataset]. https://doi.org/10.5281/zenodo.8228821

Works in public domain

A subset of public domain works from the Estonian national bibliography. The metadata dataset of public domain works is accessible:

In Zenodo repository as a TSV file:

Via the OAI-PMH protocol in MARC21XML format:

As a ZIP file in MARC21XML format:

How to cite the dataset:

National Library of Estonia. (2023). Estonian National Bibliography – works in public domain [Dataset]. https://doi.org/10.5281/zenodo.8228830

National Library of Estonia digital archive's metadata

Books

The metadata of the books from DIGAR is accessible in XML format via the OAI-PMH protocol:

Standards

The metadata of the standards from DIGAR is accessible in XML format via the OAI-PMH protocol:

Periodicals

Periodicals metadata from ENB, DIGAR and DEA

01.03.2023 Journals, Metadata, Serials

Graphics

Graphical objects metadata from ENB and DIGAR

01.03.2023 Maps, Metadata, Postcards, Posters

Sound

Sound recordings metadata from ENB and DIGAR

01.03.2023 Metadata, Sheet music, Sound recordings

Video

Video metadata from ENB

01.03.2023 Metadata, Videos

Persons and Organisations

People and collectives metadata from ENB

01.03.2023 Metadata, Organisations, Persons

Estonian National Bibliography

Metadata of all ENB subsets

01.03.2023 Estonian National Bibliography, Metadata

Estonian Legal Bibliography

Estonian Legal Bibliography metadata

01.03.2023 Digital archives

Parliamentarism

Parliamentarism collection's metadata

01.03.2023 Digital archives

Reproductions

Reproductions metadata

01.03.2023 Books, Digital archives, Newspapers

Presidents of Estonia

Presidents of Estonia collection's metadata

01.03.2023 Digital archives

Estonian Subject Thesaurus EMS

A thesaurus-structured keyword glossary covering all subjects

01.03.2023 Metadata, Subject indexing

The Estonian Subject Thesaurus is a universal controlled vocabulary in Estonian for indexing and searching various library material.
The official name of the thesaurus is "Eesti märksõnastik" and its official abbreviation is EMS. In English the thesaurus is called "Estonian Subject Thesaurus".

The subject terms from EMS are used
- in the online catalogue ESTER
- in the database of Estonian articles ISE
- in the union catalogue URRAM of the Estonian public libraries
- in various other catalogues and bibliographic databases of Estonia.

EMS includes about 61 000 preferred and nonpreferred terms.

The Estonian Subject Thesaurus is managed by the Estonian Libraries Network Consortium, areas of responsibility are divided between libraries.

Format: Machine-readable MARC21:
Request syntax: https://ems.elnet.ee/teenus.php?vorming=M&sona=[keyword]
Sample request: https://ems.elnet.ee/teenus.php?vorming=M&sona=kaubamärgid

Format: Human-readable MARC21:
Request syntax: https://ems.elnet.ee/teenus.php?vorming=i&sona=[keyword]
Sample request: https://ems.elnet.ee/teenus.php?vorming=i&sona=kaubamärgid

Format: MarcXML:
Request syntax: https://ems.elnet.ee/teenus.php?vorming=X&sona=[keyword]
Sample request: https://ems.elnet.ee/teenus.php?vorming=X&sona=kaubamärgid

Format: Multiple-word subject terms:
Request syntax: https://ems.elnet.ee/teenus.php?vorming=I&sona=[keyword]+[keyword]
Sample request: https://ems.elnet.ee/teenus.php?vorming=I&sona=asutuste+arhiivid

All subject terms including a string(truncating mark %):
Request syntax: https://ems.elnet.ee/teenus.php?vorming=I&sona=%[keyword]%
Sample request: https://ems.elnet.ee/teenus.php?vorming=I&sona=%arhiiv%

Format: Record Marc21 Authority in machine-readable format (see https://www.loc.gov/marc/specifications/):
Request syntax: https://ems.elnet.ee/id/[keyword ID]#marc21
Sample request: https://ems.elnet.ee/id/EMS007185#marc21

Format: Record Marc21 in human-readable format (Marc21-I):
Request syntax: https://ems.elnet.ee/id/[keyword ID]#marc
Sample request: https://ems.elnet.ee/id/EMS007185#marc

Format: Record in MarcXML format (see http://www.loc.gov/standards/marcxml//):
Request syntax: https://ems.elnet.ee/id/[keyword ID]#xml
Sample request: https://ems.elnet.ee/id/EMS007185#xml

If the ID is in an inappropriate format,the answer is “Terms not found”:
Sample request: https://ems.elnet.ee/id/midaiganes#marc21
If the ID is in an appropriate format but no such ID is found or the term has been deleted, the answer is 0:
Sample request: https://ems.elnet.ee/id/EMS999999#marc21

Format: Machine-readable Marc21:
Request syntax: https://ems.elnet.ee/teenus.php?id=[ID]&vorming=M
Sample request: https://ems.elnet.ee/teenus.php?id=EMS005160&vorming=M

Format: Human-readable Marc21:
Request syntax: https://ems.elnet.ee/teenus.php?id=[ID]&vorming=I
Sample request: https://ems.elnet.ee/teenus.php?id=EMS005160&vorming=I

Format: MarcXML:
Request syntax: http://ems.elnet.ee/teenus.php?id=[ID]&vorming=X
Sample request: http://ems.elnet.ee/teenus.php?id=EMS005160&vorming=X

A field (except 00X fields) consists of the field number, two indicator positions and sub-fields for data content. An empty indicator position is marked by a slash. The sub-field sign consists of the sign $ and a letter or number. Below the most substantial elements are introduced, read more at https://www.loc.gov/marc/authority/ecadhome.html

LDR leader,e.g. 00000nza2200000n%00

001 control number which is EMS ID, e.g. EMS167171

003 code of the issuer of the control number ErEMS

008 fixed-length field for various coded info, the first 6 places indicate the compilation time of the record yymmdd, e.g. 130823|n|anznnbabn||n|

040 compilation details of the record, do not vary: $aErEMS$best$cErEMS$fems

072 7 EMS subject field number where the term belongs, and EMS code, e.g. $a53$2ems

The field can be repeated

150 authorised thematic term, e.g. $ainfokeskkond

151 authorised geographic term, e.g. $aAbja-Paluoja

155 authorised form term, e.g. $aõigusaktid

450, 451, 455 nonpreferred terms (synonyms) for authorised subject terms, e.g.450 $ainforuum; 451 $aAbja; 455 $anormatiivaktid

450, 451, 455 9 English-language equivalents, e.g. 450 9 $ainformation environment; 451 9 $aNarva river; 455 9 $alegal acts

550, 551, 555 related subject terms and their URLs on sub-field $0

$wg – broader term

$wh – narrower term

$w missing – other semantic connection

150 $aalalõualuu

450 $amandibula

450 9$amandible

550 $wg$alõualuud$0https://ems.elnet.ee/id/EMS029481

550 $wh$aalalõuapõnt$0https://ems.elnet.ee/id/EMS149978

550 $aalalõualiiges$0https://ems.elnet.ee/id/EMS147267

670 source, e.g. $aRegio Eesti Teede Atlas,Regio, 1998.

680 explanation with the sub-field symbol $i, e.g. $iIsikute, organisatsioonide ja süsteemide kogum, milles kogutakse, töödeldakse ja levitatakse infot. Hõlmab ka informatsiooni ennast.

The subject thesaurus is constantly updated and can be downloaded in both MARC21 and MARCXML formats.

Machine-readable MARC21: link

Human-readable MARC21: link

MarcXML: link

Estonian National Bibliography

The National Library of Estonia's digital archive DIGAR

Digitized Estonian Articles DEA

Preprocessing

Quality Assessment

Estonian National Bibliography

Books in Estonian

Foreign language books

Works in public domain

National Library of Estonia digital archive's metadata

Books

Standards

Estonian National Bibliography

Periodicals

National Library of Estonia digital archive's metadata

Journals (–2016)

Serial publications

DIGAR Estonian Articles

Newspapers and journals (2017–)

Estonian National Bibliography

Maps

Graphic material

National Library of Estonia digital archive's metadata

Maps

Posters

Postcards

Estonian National Bibliography

Sound recordings

Sheet music

National Library of Estonia digital archive's metadata

Sound recordings

Sheet music

Estonian National Bibliography

Video

Estonian National Bibliography

Persons

Organisations

Estonian National Bibliography

Sign up to the National Library Newsletter