{"id":1754,"date":"2022-10-09T13:00:37","date_gmt":"2022-10-09T10:00:37","guid":{"rendered":"https:\/\/digilab.rara.ee\/tooriistad\/tooriist-3\/"},"modified":"2025-08-28T18:21:22","modified_gmt":"2025-08-28T15:21:22","slug":"access-to-dea-texts","status":"publish","type":"tooriistad","link":"https:\/\/digilab.rara.ee\/en\/tools\/access-to-dea-texts\/","title":{"rendered":"Access to newspapers texts"},"content":{"rendered":"\n<div class=\"wp-block-uagb-tabs uagb-block-9cd69cf2 uagb-tabs__wrap uagb-tabs__hstyle2-desktop uagb-tabs__vstyle6-tablet uagb-tabs__stack1-mobile\" data-tab-active=\"0\"><ul class=\"uagb-tabs__panel uagb-tabs__align-left\" role=\"tablist\"><li class=\"uagb-tab uagb-tabs__active\" role=\"none\"><a href=\"#uagb-tabs__tab0\" class=\"uagb-tabs-list uagb-tabs__icon-position-left\" data-tab=\"0\" role=\"tab\"><div>Description<\/div><\/a><\/li><li class=\"uagb-tab \" role=\"none\"><a href=\"#uagb-tabs__tab1\" class=\"uagb-tabs-list uagb-tabs__icon-position-left\" data-tab=\"1\" role=\"tab\"><div>User guide<\/div><\/a><\/li><\/ul><div class=\"uagb-tabs__body-wrap\">\n<div class=\"wp-block-uagb-tabs-child uagb-tabs__body-container uagb-inner-tab-0\" aria-labelledby=\"uagb-tabs__tab0\">\n<p>In addition to the <a href=\"https:\/\/dea.digar.ee\/?l=en\" target=\"_blank\" rel=\"noreferrer noopener\">DEA user interface<\/a>, there are times when direct access to the texts is needed. For this purpose, Digilab has adopted a <a href=\"https:\/\/ondemand.hpc.ut.ee\/\" target=\"_blank\" rel=\"noreferrer noopener\">separate environment<\/a>, which allows direct access to the raw texts. It enables working with R code, downloading data, saving analysis results to your computer, and sharing them with others. When using and reusing texts, it is essential to observe the license conditions.<\/p>\n<\/div>\n\n\n\n<div class=\"wp-block-uagb-tabs-child uagb-tabs__body-container uagb-inner-tab-1\" aria-labelledby=\"uagb-tabs__tab1\">\n<h2 class=\"wp-block-heading\">User guide<\/h2>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/data.digar.ee\/samples\/plots\/ajalehtedekogu1.png\" alt=\"\"\/><figcaption class=\"wp-element-caption\">Figure 1. Content of the DEA collection. Open access is indicated by green, issues in red are accessible at an authorized workstation in the National Library of Estonia or under a special arrangement.<\/figcaption><\/figure>\n\n\n\n<ul class=\"wp-block-list\">\n<li>The Digitized Estonian Articles can be searched through the web interface at <a href=\"https:\/\/dea.digar.ee\/?l=en\" target=\"_blank\" rel=\"noreferrer noopener\">https:\/\/dea.digar.ee\/?l=en<\/a> and are also accessible as a dataset. An overview of the dataset is available on a <a href=\"https:\/\/data.digar.ee\/text\/dea_info.html\" target=\"_blank\" rel=\"noreferrer noopener\">separate page<\/a> (in Estonian).<\/li>\n\n\n\n<li>The data can be accessed through the <a href=\"https:\/\/ondemand.hpc.ut.ee\/\" target=\"_blank\" rel=\"noreferrer noopener\">OnDemand<\/a> environment, where you can execute code using R.<\/li>\n\n\n\n<li>In the OnDemand environment, there is access to full texts and metadata, the ability to conduct your own analysis, and the option to download your findings. The data is open for everyone to use.<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"kuidas-alustada\">Before you begin<\/h2>\n\n\n\n<ul class=\"wp-block-list\">\n<li>To use the environment, please contact <a href=\"mailto:digilab@rara.ee\">digilab@rara.ee<\/a>.<\/li>\n\n\n\n<li>For convenient access to the dataset, an R package called <a href=\"https:\/\/github.com\/peeter-t2\/digar.txts\" target=\"_blank\" rel=\"noreferrer noopener\">digar.txts<\/a> has been created. It allows you to extract subsets from the entire collection and perform full-text searches.<\/li>\n\n\n\n<li>When processing data, you have the option to use your own code, rely on sample analyses, or extract search results in tabular format. This provides flexibility in analyzing the data according to your needs and preferences.<\/li>\n\n\n\n<li>Explore the sample analyses: <a href=\"https:\/\/data.digar.ee\/samples\/elekter_aur_hobu.html\" target=\"_blank\" rel=\"noreferrer noopener\">hobu, elekter, aur 20. saj algul<\/a> (<a href=\"https:\/\/data.digar.ee\/samples\/elekter_aur_hobu.html\" target=\"_blank\" rel=\"noreferrer noopener\">.html<\/a>, <a href=\"https:\/\/data.digar.ee\/samples\/elekter_aur_hobu.ipynb\" target=\"_blank\" rel=\"noreferrer noopener\">.ipynb<\/a>, <a href=\"https:\/\/data.digar.ee\/samples\/elekter_aur_hobu.Rmd\" target=\"_blank\" rel=\"noreferrer noopener\">.Rmd<\/a>) (in Estonian).<\/li>\n\n\n\n<li>Workshops: <a href=\"https:\/\/peetertinits.github.io\/gitbooks\/tekstid_R_2020\/\" target=\"_blank\" rel=\"noreferrer noopener\">Kevad 2020 eestikeelne l\u00fchikursus tekstit\u00f6\u00f6tlusest R-is<\/a> (in Estonian).<\/li>\n\n\n\n<li>Open science principles: <a href=\"https:\/\/digilab.rara.ee\/en\/blog\/how-to-make-a-reproducible-rara-digilab-study\/\" target=\"_blank\" rel=\"noreferrer noopener\">How to make a reproducible RaRa Digilab study<\/a>.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"pakett\">R package<\/h2>\n\n\n\n<p>Access to files is supported by the R package <a href=\"https:\/\/github.com\/peeter-t2\/digar.txts\" target=\"_blank\" rel=\"noreferrer noopener\">digar.txts<\/a>. With a few simple commands, it allows you to:<br>1) Get an overview of the dataset with file associations<br>2) Create subsets of the dataset<br>3) Perform text searches<br>4) Extract immediate context from search results<br><br>You can also save search results as a table and continue working with a smaller subset of data elsewhere.<\/p>\n\n\n\n<p>Here are the commands:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>get_digar_overview() \u2013 Retrieves an overview of the entire collection (at a high level)<\/li>\n\n\n\n<li>get_subset_meta() \u2013 Retrieves metadata for a subset of the data (at the article level)<\/li>\n\n\n\n<li>do_subset_search() \u2013 Performs a search within a subset and saves the results to a file (article by article)<\/li>\n\n\n\n<li>get_concordances() \u2013 Finds concordances in the search results, displaying the search term and its immediate context.<\/li>\n<\/ul>\n\n\n\n<p>These commands allow you to explore the collection, obtain specific subsets of data, conduct searches, and extract relevant information for further analysis.<\/p>\n\n\n\n<p>For intermediate processing, various R packages and commands are suitable. For processing in Python, the data should be collected and a new Python notebook should be created beforehand.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"koodi-kasutamine\">Use of code<\/h2>\n\n\n\n<ol class=\"wp-block-list\">\n<li>First, install the required package.<\/li>\n<\/ol>\n\n\n\n<pre class=\"wp-block-code\"><code>suppressPackageStartupMessages(library(tidyverse,lib.loc=\"\/gpfs\/space\/projects\/digar_txt\/R\/4.3\/\"))\nsuppressPackageStartupMessages(library(tidytext,lib.loc=\"\/gpfs\/space\/projects\/digar_txt\/R\/4.3\/\"))<\/code><\/pre>\n\n\n\n<ol start=\"2\" class=\"wp-block-list\">\n<li>Activate the package that was installed, use:<\/li>\n<\/ol>\n\n\n\n<pre class=\"wp-block-code\"><code>suppressPackageStartupMessages(library(digar.txts,lib.loc=\"\/gpfs\/space\/projects\/digar_txt\/R\/4.3\/\"))<\/code><\/pre>\n\n\n\n<ol start=\"3\" class=\"wp-block-list\">\n<li>Use get_digar_overview() to get overview of the collections (issue-level).<\/li>\n<\/ol>\n\n\n\n<pre class=\"wp-block-code\"><code>all_issues &lt;- get_digar_overview()<\/code><\/pre>\n\n\n\n<ol start=\"4\" class=\"wp-block-list\">\n<li>Build a custom subset through any tools in R. Here is a tidyverse style example.<\/li>\n<\/ol>\n\n\n\n<pre class=\"wp-block-code\"><code>library(tidyverse)\nsubset &lt;- all_issues %&gt;%\n    filter(DocumentType==\"NEWSPAPER\") %&gt;%\n    filter(year&gt;1880&amp;year&lt;1940) %&gt;%\n    filter(keyid==\"postimeesew\")\n<\/code><\/pre>\n\n\n\n<ol start=\"5\" class=\"wp-block-list\">\n<li>Get meta information on that subset with get_subset_meta(). If this information is reused, sometimes storing the data is useful wth the commented lines.<\/li>\n<\/ol>\n\n\n\n<pre class=\"wp-block-code\"><code>subset_meta &lt;- get_subset_meta(subset)\n#potentially write to file, for easier access if returning to it\n#readr::write_tsv(subset_meta,\"subset_meta_postimeesew1.tsv\")\n#subset_meta &lt;- readr::read_tsv(\"subset_meta_postimeesew1.tsv\")<\/code><\/pre>\n\n\n\n<ol start=\"6\" class=\"wp-block-list\">\n<li>Do a search with do_subset_search(). This exports the search results into a file. do_subset_search() ignores case.<\/li>\n<\/ol>\n\n\n\n<pre class=\"wp-block-code\"><code>do_subset_search(searchterm=\"lurich\", searchfile=\"lurich1.txt\",subset)<\/code><\/pre>\n\n\n\n<ol start=\"7\" class=\"wp-block-list\">\n<li>Read the search results. Use any R tools. It's useful to name the id and text columns id and txt.<\/li>\n<\/ol>\n\n\n\n<pre class=\"wp-block-code\"><code>texts &lt;- fread(\"lurich1.txt\",header=F)&#91;,.(id=V1,txt=V2)]<\/code><\/pre>\n\n\n\n<ol start=\"8\" class=\"wp-block-list\">\n<li>Get concordances using the get_concordances() command.<\/li>\n<\/ol>\n\n\n\n<pre class=\"wp-block-code\"><code>concs &lt;- get_concordances(searchterm=\"&#91;Ll]urich\",texts=texts,before=30,after=30,txt=\"txt\",id=\"id\")<\/code><\/pre>\n\n\n\n<ol start=\"9\" class=\"wp-block-list\">\n<li>Note that many sources have not been segmented into artilces during digitization. On them both meta and text information need to be accessed on the page level, where files are located in a different folder. The sequence for pages would be:<\/li>\n<\/ol>\n\n\n\n<pre class=\"wp-block-code\"><code>subset2 &lt;- all_issues %&gt;%\n    filter(DocumentType==\"NEWSPAPER\") %&gt;%\n    filter(year&gt;1951&amp;year&lt;2002) %&gt;%\n    filter(keyid==\"stockholmstid\")\n\n# The subset2 from stockholstid has 0 issues with section-level data, but 2178 issues with page-level data. In this case pages should be used. When combining sources with page and section sources, custom combinations can be made based on the question at hand. Note that pages data includes also the sections data when available, so using both at the same time can bias the results.\n# subset2 %&gt;% filter(sections_exist==T) %&gt;% nrow()\n# subset2 %&gt;% filter(pages_exist==T) %&gt;% nrow()\n\nsubset_meta2 &lt;- get_subset_meta(subset2, source=\"pages\")\n\ndo_subset_search(searchterm=\"eesti\", searchfile=\"eesti1.txt\",subset2, source=\"pages\")<\/code><\/pre>\n\n\n\n<p>Convenience suggestion: to use ctrl-shift-m to make %&gt;% function in the JupyterLab as in RStudio, add this code in Settings -&gt; Advanced Settings Editor\u2026 -&gt; Keyboard Shortcuts, on the left in the User Preferences box.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>{\n    \"shortcuts\": &#91;\n         {\n            \"command\": \"notebook:replace-selection\",\n            \"selector\": \".jp-Notebook\",\n            \"keys\": &#91;\"Ctrl Shift M\"],\n            \"args\": {\"text\": '%&gt;% '}\n        }\n    ]\n}<\/code><\/pre>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"lihtsamad-r-i-k\u00e4sud\">Basic R commands<\/h2>\n\n\n\n<ul class=\"wp-block-list\">\n<li>&lt;- - save to variable<\/li>\n\n\n\n<li>%&gt;% - a 'pipe' that directs the output of a function to the input of the next one<\/li>\n\n\n\n<li>filter() - filter your data<\/li>\n\n\n\n<li>count() - count specific values<\/li>\n\n\n\n<li>mutate() - make a new column<\/li>\n\n\n\n<li>head(n) - show the first n lines<\/li>\n<\/ul>\n<\/div>\n<\/div><\/div>\n","protected":false},"featured_media":1201,"template":"","tooriista_tyyp":[],"class_list":["post-1754","tooriistad","type-tooriistad","status-publish","has-post-thumbnail","hentry"],"acf":[],"uagb_featured_image_src":{"full":["https:\/\/digilab.rara.ee\/wp-content\/uploads\/2023\/02\/pexels-digital-buggu-167538-1.jpg",1920,1280,false],"thumbnail":["https:\/\/digilab.rara.ee\/wp-content\/uploads\/2023\/02\/pexels-digital-buggu-167538-1-150x150.jpg",150,150,true],"medium":["https:\/\/digilab.rara.ee\/wp-content\/uploads\/2023\/02\/pexels-digital-buggu-167538-1-300x200.jpg",300,200,true],"medium_large":["https:\/\/digilab.rara.ee\/wp-content\/uploads\/2023\/02\/pexels-digital-buggu-167538-1-768x512.jpg",768,512,true],"large":["https:\/\/digilab.rara.ee\/wp-content\/uploads\/2023\/02\/pexels-digital-buggu-167538-1-1024x683.jpg",1024,683,true],"1536x1536":["https:\/\/digilab.rara.ee\/wp-content\/uploads\/2023\/02\/pexels-digital-buggu-167538-1-1536x1024.jpg",1536,1024,true],"2048x2048":["https:\/\/digilab.rara.ee\/wp-content\/uploads\/2023\/02\/pexels-digital-buggu-167538-1.jpg",1920,1280,false]},"uagb_author_info":{"display_name":"Laura Nemvalts","author_link":"https:\/\/digilab.rara.ee\/en\/author\/"},"uagb_comment_info":0,"uagb_excerpt":null,"_links":{"self":[{"href":"https:\/\/digilab.rara.ee\/en\/wp-json\/wp\/v2\/tooriistad\/1754","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/digilab.rara.ee\/en\/wp-json\/wp\/v2\/tooriistad"}],"about":[{"href":"https:\/\/digilab.rara.ee\/en\/wp-json\/wp\/v2\/types\/tooriistad"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/digilab.rara.ee\/en\/wp-json\/wp\/v2\/media\/1201"}],"wp:attachment":[{"href":"https:\/\/digilab.rara.ee\/en\/wp-json\/wp\/v2\/media?parent=1754"}],"wp:term":[{"taxonomy":"tooriista_tyyp","embeddable":true,"href":"https:\/\/digilab.rara.ee\/en\/wp-json\/wp\/v2\/tooriista_tyyp?post=1754"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}