Public corpus

The National Library of Estonia, in collaboration with the University of Tartu, has created a language corpus containing 526 million tokens.

The purpose of the corpus is to increase the availability of language data for linguistic research, language technology development, and the preservation and accessibility of cultural heritage.

The corpus contains metadata-annotated texts in Estonian and other Finno-Ugric languages, excluding Finnish and Hungarian. As such, it enables the study of both Estonian and smaller, lesser-studied related languages, and supports comparative and historical linguistic research. The texts are sourced from a variety of books, newspapers, journals, standards, and serials made available through the cultural heritage portal DIGAR. This diverse range of sources covers a wide variety of language use, enabling analysis across genres and contexts.

Due to the size of the corpus, it is not feasible to share it online; to request access, please contact digilab@rara.ee.

The language corpus was developed with the support of the Recovery and Resilience Facility component 3 "Digital state" reform "Creation and development of a centre of excellence for data management and open data".

Sign up to the National Library Newsletter

    OPEN
    Solaris Embassy
    Mon-Sat 10—20
    Sun 12—19

    Repository Library
    Mon-Fri 9—16.30
    Sat,Sun closed
    CONTACT

    National Library of Estonia
    Tõnismägi 2, 10122 Tallinn
    +372 630 7100
    info@rara.ee
    rara.ee/en

    linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram