Public corpus

The National Library of Estonia, in collaboration with the University of Tartu, has created a language corpus containing 526 million tokens.

The purpose of the corpus is to increase the availability of language data for linguistic research, language technology development, and the preservation and accessibility of cultural heritage.

The corpus contains metadata-annotated texts in Estonian and other Finno-Ugric languages, excluding Finnish and Hungarian. As such, it enables the study of both Estonian and smaller, lesser-studied related languages, and supports comparative and historical linguistic research. The texts are sourced from a variety of books, newspapers, journals, standards, and serials made available through the cultural heritage portal DIGAR. This diverse range of sources covers a wide variety of language use, enabling analysis across genres and contexts.

Due to the size of the corpus, it is not feasible to share it online; to request access, please contact digilab@rara.ee.

The language corpus was developed with the support of the Recovery and Resilience Facility component 3 "Digital state" reform "Creation and development of a centre of excellence for data management and open data".

Public corpus

Sign up to the National Library Newsletter