{"id":5217,"date":"2026-04-16T13:45:07","date_gmt":"2026-04-16T10:45:07","guid":{"rendered":"https:\/\/digilab.rara.ee\/?post_type=andmestikud&#038;p=5217"},"modified":"2026-04-17T10:49:23","modified_gmt":"2026-04-17T07:49:23","slug":"public-corpus","status":"publish","type":"andmestikud","link":"https:\/\/digilab.rara.ee\/en\/datasets\/public-corpus\/","title":{"rendered":"Public corpus"},"content":{"rendered":"\n<p><strong>The National Library of Estonia, in collaboration with the University of Tartu, has created a language corpus&nbsp;containing&nbsp;526 million tokens.<\/strong><\/p>\n\n\n\n<p>The purpose of the corpus is to increase the availability of language data for linguistic&nbsp;research, language technology&nbsp;development,&nbsp;and the preservation and accessibility of cultural heritage.<\/p>\n\n\n\n<p>The corpus&nbsp;contains&nbsp;metadata-annotated texts in Estonian and other Finno-Ugric languages, excluding Finnish and Hungarian. As such, it enables the study of both Estonian and smaller, lesser-studied related languages, and supports comparative and historical linguistic research. The texts are sourced from a variety of books, newspapers, journals, standards, and serials made available through the cultural heritage portal DIGAR. This diverse range of sources covers a wide variety of language use, enabling analysis across genres and contexts.<\/p>\n\n\n\n<p>Due to the size of the corpus, it is not&nbsp;feasible&nbsp;to share it online; to request access, please contact&nbsp;<a href=\"mailto:digilab@rara.ee\" target=\"_blank\" rel=\"noreferrer noopener\">digilab@rara.ee<\/a>.<\/p>\n\n\n\n<p>The language corpus was developed with the support of the&nbsp;Recovery and Resilience Facility&nbsp;component&nbsp;3 \"Digital&nbsp;state\" reform \"Creation and development of a centre of excellence for data management and open data\".<\/p>\n\n\n\t\t\t\t\t<div\n\t\t\t\t\t\tclass=\"wp-block-uagb-image-gallery uagb-block-d598879e     \"\n\t\t\t\t\t\tstyle=\"\"\n\t\t\t\t\t>\n\t\t\t\t\t\t\t\t\t\t\t<div class=\"spectra-image-gallery spectra-image-gallery__layout--masonry spectra-image-gallery__layout--masonry-col-2 spectra-image-gallery__layout--masonry-col-tab-3 spectra-image-gallery__layout--masonry-col-mob-2\">\n\t\t\t\t\t\t\t\t\t\t\t<div class='spectra-image-gallery__media-wrapper--isotope'>\n\t\t\t\t\t\t\t<div class='spectra-image-gallery__media-wrapper' data-spectra-gallery-image-id='5192' tabindex=\"0\">\n\t\t\t\t\t\t\t<div class=\"spectra-image-gallery__media spectra-image-gallery__media--masonry\">\n\t\t\t\t<picture>\n\t\t\t\t\t<source media=\"(min-width: 1024px)\" srcset=\"https:\/\/digilab.rara.ee\/wp-content\/uploads\/2026\/03\/NextGen_Rahastanud_EL_NextGeneration_EST_hor_color_RGB.jpg\">\n\t\t\t\t\t<source media=\"(min-width: 768px)\" srcset=\"https:\/\/digilab.rara.ee\/wp-content\/uploads\/2026\/03\/NextGen_Rahastanud_EL_NextGeneration_EST_hor_color_RGB.jpg\">\n\t\t\t\t\t<img decoding=\"async\" class=\"spectra-image-gallery__media-thumbnail spectra-image-gallery__media-thumbnail--masonry\" src=\"https:\/\/digilab.rara.ee\/wp-content\/uploads\/2026\/03\/NextGen_Rahastanud_EL_NextGeneration_EST_hor_color_RGB-300x168.jpg\" alt=\"\" loading=\"lazy\" \/>\n\t\t\t\t<\/picture>\n\t\t\t\t<div class=\"spectra-image-gallery__media-thumbnail-blurrer\"><\/div>\n\t\t\t\t\t\t\t\t\t\t\t<div class=\"spectra-image-gallery__media-thumbnail-caption-wrapper spectra-image-gallery__media-thumbnail-caption-wrapper--overlay\"><\/div>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t\t\t<\/div>\n\t\t\t\t\t\t<\/div>\n\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t\t\t\t\t\t\t\t<div class='spectra-image-gallery__control-lightbox' tabindex='0'>\n\t\t\t\t\t\t\t\t\t\t\t<div class=\"swiper spectra-image-gallery__control-lightbox--main\" dir=\"\">\n\t\t\t\t\t<div class=\"swiper-wrapper\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t<div class=\"swiper-slide\">\n\t\t\t\t\t\t\t\t<img class=\"swiper-lazy\" data-src=\"https:\/\/digilab.rara.ee\/wp-content\/uploads\/2026\/03\/NextGen_Rahastanud_EL_NextGeneration_EST_hor_color_RGB.jpg\" alt=\"\"\/>\n\t\t\t\t\t\t\t\t<div class=\"swiper-lazy-preloader swiper-lazy-preloader-white\"><\/div>\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t\t<div class=\"swiper-button-next\"><\/div>\n\t\t\t\t\t<div class=\"swiper-button-prev\"><\/div>\n\t\t\t\t<\/div>\t\t\t\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<button class='spectra-image-gallery__control-lightbox--close' aria-label=\"Close\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t<svg xmlns=\"https:\/\/www.w3.org\/2000\/svg\" viewBox= \"0 0 320 512\"><path d=\"M310.6 361.4c12.5 12.5 12.5 32.75 0 45.25C304.4 412.9 296.2 416 288 416s-16.38-3.125-22.62-9.375L160 301.3L54.63 406.6C48.38 412.9 40.19 416 32 416S15.63 412.9 9.375 406.6c-12.5-12.5-12.5-32.75 0-45.25l105.4-105.4L9.375 150.6c-12.5-12.5-12.5-32.75 0-45.25s32.75-12.5 45.25 0L160 210.8l105.4-105.4c12.5-12.5 32.75-12.5 45.25 0s12.5 32.75 0 45.25l-105.4 105.4L310.6 361.4z\"><\/path><\/svg>\n\t\t\t\t\t\t\t\t\t\t\t\t<\/button>\n\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t\t\t\t\t\t","protected":false},"featured_media":5106,"template":"","format":[],"meta":{"_acf_changed":true,"_uag_custom_page_level_css":""},"postcategory":[],"copyright":[],"tags":[],"paritolu":[],"andmestiku_tyyp":[],"class_list":["post-5217","andmestikud","type-andmestikud","status-publish","has-post-thumbnail","hentry"],"acf":[],"uagb_featured_image_src":{"full":["https:\/\/digilab.rara.ee\/wp-content\/uploads\/2026\/03\/digilab_arvuti-2.png",1920,1080,false],"thumbnail":["https:\/\/digilab.rara.ee\/wp-content\/uploads\/2026\/03\/digilab_arvuti-2-150x150.png",150,150,true],"medium":["https:\/\/digilab.rara.ee\/wp-content\/uploads\/2026\/03\/digilab_arvuti-2-300x169.png",300,169,true],"medium_large":["https:\/\/digilab.rara.ee\/wp-content\/uploads\/2026\/03\/digilab_arvuti-2-768x432.png",768,432,true],"large":["https:\/\/digilab.rara.ee\/wp-content\/uploads\/2026\/03\/digilab_arvuti-2-1024x576.png",1024,576,true],"1536x1536":["https:\/\/digilab.rara.ee\/wp-content\/uploads\/2026\/03\/digilab_arvuti-2-1536x864.png",1536,864,true],"2048x2048":["https:\/\/digilab.rara.ee\/wp-content\/uploads\/2026\/03\/digilab_arvuti-2.png",1920,1080,false]},"uagb_author_info":{"display_name":"Laura Nemvalts","author_link":"https:\/\/digilab.rara.ee\/en\/author\/"},"uagb_comment_info":0,"uagb_excerpt":"The National Library of Estonia, in collaboration with the University of Tartu, has created a language corpus&nbsp;containing&nbsp;526 million tokens. The purpose of the corpus is to increase the availability of language data for linguistic&nbsp;research, language technology&nbsp;development,&nbsp;and the preservation and accessibility of cultural heritage. The corpus&nbsp;contains&nbsp;metadata-annotated texts in Estonian and other Finno-Ugric languages, excluding Finnish and&hellip;","_links":{"self":[{"href":"https:\/\/digilab.rara.ee\/en\/wp-json\/wp\/v2\/andmestikud\/5217","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/digilab.rara.ee\/en\/wp-json\/wp\/v2\/andmestikud"}],"about":[{"href":"https:\/\/digilab.rara.ee\/en\/wp-json\/wp\/v2\/types\/andmestikud"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/digilab.rara.ee\/en\/wp-json\/wp\/v2\/media\/5106"}],"wp:attachment":[{"href":"https:\/\/digilab.rara.ee\/en\/wp-json\/wp\/v2\/media?parent=5217"}],"wp:term":[{"taxonomy":"format","embeddable":true,"href":"https:\/\/digilab.rara.ee\/en\/wp-json\/wp\/v2\/format?post=5217"},{"taxonomy":"postcategory","embeddable":true,"href":"https:\/\/digilab.rara.ee\/en\/wp-json\/wp\/v2\/postcategory?post=5217"},{"taxonomy":"copyright","embeddable":true,"href":"https:\/\/digilab.rara.ee\/en\/wp-json\/wp\/v2\/copyright?post=5217"},{"taxonomy":"tags","embeddable":true,"href":"https:\/\/digilab.rara.ee\/en\/wp-json\/wp\/v2\/tags?post=5217"},{"taxonomy":"paritolu","embeddable":true,"href":"https:\/\/digilab.rara.ee\/en\/wp-json\/wp\/v2\/paritolu?post=5217"},{"taxonomy":"andmestiku_tyyp","embeddable":true,"href":"https:\/\/digilab.rara.ee\/en\/wp-json\/wp\/v2\/andmestiku_tyyp?post=5217"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}