{"id":3797,"date":"2022-11-27T18:30:00","date_gmt":"2022-11-27T15:30:00","guid":{"rendered":"https:\/\/digilab.rara.ee\/blogi\/languages-in-dea-newspaper-collection-1850-1918\/"},"modified":"2025-03-16T13:41:37","modified_gmt":"2025-03-16T10:41:37","slug":"languages-in-dea-newspaper-collection-1850-1918","status":"publish","type":"blogi","link":"https:\/\/digilab.rara.ee\/en\/blog\/languages-in-dea-newspaper-collection-1850-1918\/","title":{"rendered":"Languages in DEA newspaper collection 1850-1918"},"content":{"rendered":"\n<p class=\"has-medium-font-size\"><strong>Summary<\/strong><\/p>\n\n\n\n<p>This blog post explores languages inside the \u201cRussian\u201d part of DIGAR Collection. The metadata for newspapers issued before 1918 shows large number of Russian-language newspapers, however both the page and section-level data provide an evidence that the newspapers were not monolingual and some part of the collection labeled as Russian is in fact in German, Estonian and Latvian. The analysis provided aims to show how the history of Russification in Estonia is reflected in digitized newspaper pages and how metadata issues may probably influence OCR quality.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<p>According to the metadata, a large part of the newspapers issued during the imperial period (before 1917) is marked as Russian-language ones. If counted in pages, there should be more than 120 thousands pages labeled as pages in \u201cRussian\u201d (Language column), with the following total distribution of pages over time:<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"731\" src=\"https:\/\/digilab.rara.ee\/wp-content\/uploads\/2023\/11\/image-1024x731.png\" alt=\"\" class=\"wp-image-3114\" srcset=\"https:\/\/digilab.rara.ee\/wp-content\/uploads\/2023\/11\/image-1024x731.png 1024w, https:\/\/digilab.rara.ee\/wp-content\/uploads\/2023\/11\/image-300x214.png 300w, https:\/\/digilab.rara.ee\/wp-content\/uploads\/2023\/11\/image-768x549.png 768w, https:\/\/digilab.rara.ee\/wp-content\/uploads\/2023\/11\/image.png 1344w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>After retrieving these pages\u2019 texts, it is obvious that at least partly the title-level language metadata is wrong: e.g.&nbsp;the earliest \u201cRussian\u201d pages in the collection dated 1852 are mostly in German.<\/p>\n\n\n\n<p class=\"has-text-align-center\"><strong>text<\/strong><\/p>\n\n\n\n<div class=\"wp-block-group is-vertical is-content-justification-center is-layout-flex wp-container-core-group-is-layout-ce155fab wp-block-group-is-layout-flex\">\n<p class=\"has-text-align-center\">1842 sich au\u00dfer der Stadtbefehlshaberschaft auch auf die Stadt Rostow (im Jekaterinoslawschen Gouvernement) erstreck. Der Procureur d\u2019er StadtbefeHlstzaberAchO\u00dfk. d) I n Taganrog. Im Ressort des Ministeriums des Innern: Die Taganrogsche Stadtpolizei, unter der Das Polizeiwesen nicht nur in der Stade Taganrog, sondern oud) m dem dazu geh\u00f6renden Kreise steht. Die Taganrogsche Stadt-Duma. Die Taganr<\/p>\n\n\n\n<p class=\"has-text-align-center\">0\u0442\u0434 \u0435\u043b\u044a \u0432\u0442\u043e\u0440\u043e\u0439. Zweite Abtheilung. \u0427\u0430\u0441\u0442\u044c \u043e\u0444\u0444\u0438\u0446\u0438\u0430\u043b\u044c\u043d\u0430\u044f. Officieller Theil. Durch Allerh\u00f6chsten Tagesbefehl im CimlRessort vom 19\u00bb October 1852, Nr. 269, sind best\u00e4tigt: Als Rigascher Ordnungsrichter Major von Tiesenhausen, der auch schon bei fr\u00fchem Wahlen dieses Amt bekleidete, und als D\u00f6rptscher der im Jahre 1836 aus der 11\u00bb ArtillerieBrigade als Stabs-Capitam entlassene von Oeningen. Als Adju<\/p>\n\n\n\n<p class=\"has-text-align-center\">Md. XIV. Ust. \u00fcber P\u00e4sse und L\u00e4uflinqe Art. 589. Nr. 82. \u2019 Diebstahl. Publication der Strafbestimmungen f\u00fcr denselben. Nr. 85 u. 86, lettisch und ehstnisch Nr. 91 u. 92. Dienstboten-B\u00fccher Nr. 54, lettisch und ehstnisch Nr. 55, deutsch, lett. und ehst. Nr. 56. Ausreichung derselben Nr. 61. Verbot aller R\u00fcgen und Bemerkungen \u00fcber die F\u00fchrung Nr. 78, 79, 81. Don au-F\u00fcrstent\u00fcmer, vide Manifest. Dorpa<\/p>\n<\/div>\n\n\n\n<div style=\"height:10px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<p class=\"has-medium-font-size\"><strong>Language detection<\/strong><\/p>\n\n\n\n<p>To make a new rough language detection an R-package textcat was used. It takes a text as a string and calculates its average score to detect a language, so as a result a new metadata column will have one most probable language detected for each page.<\/p>\n\n\n\n<p>A better solution may be found here, since many pages will probably have several languages combined at one page. However, this package is still applicable for a rough estimation of major languages proportion in these pages.<\/p>\n\n\n\n<p>textcat language detection is not ideal, the full set of detected languages shows how low-quality OCR influences the results. Few pages were tagged with languages very unlikely used in Estonian newspapers, such as Nepali or Indonesian. The language distribution for the languages appearing in at least 150 pages is the following:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>## \n##  estonian    german   latvian   russian ukrainian \n##       699     49328       128     74815      3102<\/code><\/pre>\n\n\n\n<p>However, even the \u201cUkrainian\u201d pages should be excluded from the analysis, since these pages are in fact badly OCR-ed pages in Cyrillic script. Randomly selected chunks shows these pages are mostly in Russian, but being unable to check them all I filtered these pages as highly messy and unpredictable. This result is very disappointing showing not a language diversity but issues inside tools trained on major languages and giving an imperialistic assumption on Ukrainian being (on the character level) a messy Cyrillic which is not Russian (see an example below).<\/p>\n\n\n\n<p class=\"has-text-align-center\"><strong>text<\/strong><\/p>\n\n\n\n<div class=\"wp-block-group is-vertical is-content-justification-center is-layout-flex wp-container-core-group-is-layout-ce155fab wp-block-group-is-layout-flex\">\n<p class=\"has-text-align-center\">4 \u0420 \u0435 \u0432 \u0435 \u043b \u044c \u043e x I \u043d fl I :i lotis. M 259 \u041e \u041a \u042a \u042f \u0412 \u041b \u041a \u041f I \u044f. \u043e\u0442\u044a \u044d\u0441\u0442\u043b\u044f\u043d\u0434\u0441\u043a\u043e\u0439 \u041a\u0410\u0417\u0415\u041d\u041d\u041e\u0419 \u041f\u0410\u041b\u0410\u0422\u042b \u041e\u0411\u042a\u042f\u0412\u041b\u0415\u041d\u0418\u0415. \u0423\u044a\u0437\u043b\u043d\u044b\u043d \u041a\u0430\u0437\u043d\u0430\u0447\u0435\u0439\u0441\u0442\u0432\u0430 \u0412\u0435\u0437\u0435\u043d\u0431\u0435\u0440\u0433\u0441\u043a\u043e\u0435. \u0412\u0435\u0439\u0441\u0435\u043d\u0448\u0442\u0435\u0439\u043d\u0441\u043a\u043e\u0435 \u0438 \u0413\u0430\u043f\u0441\u0430\u043b\u044c\u0441\u043a\u043e\u0435, \u042d\u0441\u0442\u043b\u044f\u043d\u0434\u0441\u043a\u043e\u0439 \u0433\u0443\u0431\u0435\u0440\u043d\u0448, \u0431\u0443\u0434\u0443\u0442\u044a \u043f\u0440\u043e\u0438\u0437\u0432\u043e\u0434\u0438\u0442\u044c \u0441\u044a 1-\u0433\u043e \u044f\u043d\u0432\u0430\u0440\u044f 1897 \u0433. \u0441\u043b\u042c\u0434\u0443\u044e\u0449\u0448 \u0431\u0430\u0439\u043a\u043e\u0432\u044b\u0439 oiiepuuiii: 1) \u0440\u0430\u0437\u043c\u044a\u043f\u044a \u0434\u0435\u043d\u0435\u0433\u044c \u043a\u0440\u0443\u043f\u043d\u044b\u0445\u044a \u043a\u0440\u0435\u0434\u0438\u0442\u043d\u044b\u0445!, \u0431\u0438\u043b\u0435\u0442\u043e\u0432\u044a \u043d\u0430 \u043c\u0435\u043b\u043a\u0430*, \u043c\u0435\u043b\u043a\u0438\u0445\u044c \u043d\u0430 \u043a\u0440\u0443\u043f\u043d\u044b\u0435 \u0438 \u0432\u0435\u0442\u0445\u043d\u0445\u044a \u043d\u0430 \u0433\u043e\u0434\u043d\u044b\u0435 \u043a\u044a \u043e\u0431\u0440\u0430\u0449\u0435\u0448\u044e; 2) \u043f\u043e\u043a\u0443\u043f\u043a\u0443 \u0431\u043d\u043b\u0435\u0442\u043e\u0432\u044a<\/p>\n\n\n\n<p class=\"has-text-align-center\">Ui urm paifUiitMB, t. \u0420\u043c\u0448, 87-y, urym 1900 ff.&nbsp;\u041f\u0438\u0432<em>11\u044f1\u0419\u0438|\u0446 \u25a0\u2014\u0443\u0443\u043f. \u0422\u0430\u043c\u0433\u0440\u0430\u0444\u042b (umt;.r\u00abaib\u00abaia Uua.-\u00abiB\u2019. 4 \u0420 \u0432 \u0432 \u0435 \u043b\u044c oxl \u0430 \u0418 tat \u043e<\/em>! \u0430<em>. \u00ab19: \u041e\u0411\u042a\u042f\u0412\u041b\u0415\u041d! \u042f. \u0423\u043f\u0440\u0430\u0438\u0435 ul\u2019 BajTlBcKol \u0438 \u0412\u043e\u043c\u043c-\u0422\u0438\u0436\u0433\u043a\u043e! \u0436\u043c\u042a\u0448\u044a\u0438\u044a \u0434\u043e\u0440\u043e\u0433\u044a &amp;gt;\u0442\u0440\u00bb \u0412\u0441\u043b\u0432 \u00bb\u00bb n n<\/em>p<em>un \u0442\u043e\u0440\u0433\u0438\u00bb nnrimiiM \u00ab \u00bbyxert \u0430\u0440.\u0445\u0445\u043e\u0436-\u00ab\u043e \u00abkau porno \u00bbkau. TO. \u00bba oceoaaaia jj 11 oaaaomuik MpaaaJV \u0431\u0443\u0434\u0435\u0442\u00bb \u0430\u0440\u043e\u0433\u0430\u0430\u0435\u0434\u0435\u00bb<\/em> ifrjiufuol \u00abtau \u0422\u041e\u00bb, pork - 14 \u0414\u0433\u0430\u0430\u0432\u0440\u0430 IMI r. i \u0431\u0430\u0433<\/p>\n\n\n\n<p class=\"has-text-align-center\">\u0420\u043c\u043c\u0438 F tr\u00ab \u0418 \u00bb\u00bbHlUi \u0411 \u042a \u041c R .1 \u0412 \u0412 I \u0413 \u041c 246. \u2022\u0430 j \u041e .\u2022\u0433\u0430 \u0418\u042f\u0418\u0418\u041d\u0414\u0419\u0419\u0418\u041a\u042f\u041a\u0419\u041a\u0438\u0446 \u0414\u0414\u0419\u0419\u0419\u0419\u0419\u0419 \u0414 \u0438 \u043c<em>\u00a3\u0414 \u042f KjlH \u0418 \u041d \u0414 \u0414 \u041d \u041d \u041d \u041d \u0418 \u0418 \u041d \u0426 U \u041a|\u0414 \u0426|\u0426 \u0418\u0414\u0418 \u041d.\u0419\u0414\u0419 \u0414\u0419 \u0414\u0419\u0419\u0419\u0419 \u044f \u041e\u0422\u041a\u0420\u042b\u0422\u0410 \u041f\u041e\u0414\u041f\u0418\u0421\u041a\u0410 \u0417\u0429\u0413 \u043d\u0430 1894 \u0433\u043e\u0434\u044a \u201c\u201cW&amp;gt;\u2019 \u043d\u0430 \u0435\u0433\u0435\u0434\u0438\u0435\u0432\u0432\u0443\u044e \u0433\u0430\u0437\u0435\u0442\u0443 \u043c\u042a\u0441\u0442\u043d\u044b\u0445\u044a \u0438\u043d\u0442\u0435\u0440\u0435\u0441\u043e\u0432\u044a, \u043b\u0438\u0442\u0435\u0440\u0430\u0442\u0443\u0440\u043d\u0443\u044e \u0438 \u043f\u043e\u043b\u0438\u0442\u0438\u0447\u0435\u0441\u043a\u0443\u044e \u201e\u0420\u0435\u0432\u0435jbCKifl l\u00fcitem\u201c \u041f\u043e\u0434\u043b. \u0446<\/em>\u043d\u0430: \u0441\u044a \u041f\u0435\u0440\u0435\u0441, \u0438 \u0434\u043e\u0441\u0442. \u043d\u0430 1 \u0433\u043e\u0434\u044a \u2026 &amp;amp; \u0440\u0443\u0431. \u2014 \u043a\u043e\u043f. \u00bb \u00bb \u00bb \u0442 6 \u043c*\u0441\u044f\u0446\u0435\u0432\u044a . Q \u00bb \u2014 \u00bb \u00bb \u00bb \u00bb \u043d\u0430 3 \u00bb .<\/p>\n<\/div>\n\n\n\n<p>If we exclude the \u201cUkrainian\u201d from consideration, this is the language distribution obtained with textcat language detection:<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"731\" src=\"https:\/\/digilab.rara.ee\/wp-content\/uploads\/2023\/11\/image-5-1024x731.png\" alt=\"\" class=\"wp-image-3124\" srcset=\"https:\/\/digilab.rara.ee\/wp-content\/uploads\/2023\/11\/image-5-1024x731.png 1024w, https:\/\/digilab.rara.ee\/wp-content\/uploads\/2023\/11\/image-5-300x214.png 300w, https:\/\/digilab.rara.ee\/wp-content\/uploads\/2023\/11\/image-5-768x549.png 768w, https:\/\/digilab.rara.ee\/wp-content\/uploads\/2023\/11\/image-5.png 1344w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>I believe this result is important, since it clearly shows that part of the metadata labels for \u201cRussian\u201d language used in 19-century newspapers should be reassessed: clearly, these are newspapers in German.<\/p>\n\n\n\n<p>At the same time, I find this language switch from German to Russian in late 1880s very convincing, since the 1880s are known as a Russification period. Curiously enough, it happened very abruptly, so it might be interesting for further researchers to see if the content of newspapers switched from German to Russian changed significantly.<\/p>\n\n\n\n<p>To add more details, the language distribution does not depend on place of publishing, which in overwhelming majority of cases are Riga or Tallinn (Revel):<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"731\" src=\"https:\/\/digilab.rara.ee\/wp-content\/uploads\/2023\/11\/image-2-1024x731.png\" alt=\"\" class=\"wp-image-3118\" srcset=\"https:\/\/digilab.rara.ee\/wp-content\/uploads\/2023\/11\/image-2-1024x731.png 1024w, https:\/\/digilab.rara.ee\/wp-content\/uploads\/2023\/11\/image-2-300x214.png 300w, https:\/\/digilab.rara.ee\/wp-content\/uploads\/2023\/11\/image-2-768x549.png 768w, https:\/\/digilab.rara.ee\/wp-content\/uploads\/2023\/11\/image-2.png 1344w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"731\" src=\"https:\/\/digilab.rara.ee\/wp-content\/uploads\/2023\/11\/image-4-1024x731.png\" alt=\"\" class=\"wp-image-3122\" srcset=\"https:\/\/digilab.rara.ee\/wp-content\/uploads\/2023\/11\/image-4-1024x731.png 1024w, https:\/\/digilab.rara.ee\/wp-content\/uploads\/2023\/11\/image-4-300x214.png 300w, https:\/\/digilab.rara.ee\/wp-content\/uploads\/2023\/11\/image-4-768x549.png 768w, https:\/\/digilab.rara.ee\/wp-content\/uploads\/2023\/11\/image-4.png 1344w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<div style=\"height:10px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<p class=\"has-medium-font-size\"><strong>OCR accuracy &amp; languages<\/strong><\/p>\n\n\n\n<p>From a technical point of view, the languages (or scripts) hidden behind the \u201cRussian\u201d tag in metadata seems to be important for the OCR quality. If we are to plot OCR accuracy values for decades it may seem that there is a decrease in 1880-s, as one might suppose, led by Russification. This is, however, misleading, because the reason behind the decrease is different OCR accuracy for German and Russian (or Latin and Cyrillic scripts in general). The plot below displays how the OCR quality distributed according to the new language metadata; it is also visible that pages detected as \u201cUkrainian\u201d have very low OCR quality.<\/p>\n\n\n\n<p>To sum up, language reassessment on a more granular level (i.e.&nbsp;section) seems important and feasible to be added to respective metadata entries and then grouped to multilingual labels for the language data of entire newspaper titles.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"731\" src=\"https:\/\/digilab.rara.ee\/wp-content\/uploads\/2023\/11\/image-7-1024x731.png\" alt=\"\" class=\"wp-image-3127\" srcset=\"https:\/\/digilab.rara.ee\/wp-content\/uploads\/2023\/11\/image-7-1024x731.png 1024w, https:\/\/digilab.rara.ee\/wp-content\/uploads\/2023\/11\/image-7-300x214.png 300w, https:\/\/digilab.rara.ee\/wp-content\/uploads\/2023\/11\/image-7-768x549.png 768w, https:\/\/digilab.rara.ee\/wp-content\/uploads\/2023\/11\/image-7.png 1344w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"731\" src=\"https:\/\/digilab.rara.ee\/wp-content\/uploads\/2023\/11\/image-8-1024x731.png\" alt=\"\" class=\"wp-image-3130\" srcset=\"https:\/\/digilab.rara.ee\/wp-content\/uploads\/2023\/11\/image-8-1024x731.png 1024w, https:\/\/digilab.rara.ee\/wp-content\/uploads\/2023\/11\/image-8-300x214.png 300w, https:\/\/digilab.rara.ee\/wp-content\/uploads\/2023\/11\/image-8-768x549.png 768w, https:\/\/digilab.rara.ee\/wp-content\/uploads\/2023\/11\/image-8.png 1344w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<div style=\"height:25px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<h1 class=\"wp-block-heading has-large-font-size\">Language detection based on sections data<\/h1>\n\n\n\n<p><em>Report update: 10\/11\/2022<\/em><\/p>\n\n\n\n<p>The part above showed the difference between title-level language metadata and actual languages used in the governmental newspapers in the collection: while the newspapers\u2019 language defined as \u201cRussian\u201d, digitized materials allow to conclude that newspapers used predominantly German as a language of communication until mid-1880s. The analysis was done on the page level, determining the proportion of languages used and labeling each page with most probable language. The results might be yet improved using as input data not the whole pages but newspaper segments or <strong>sections<\/strong>, where each section will most probably be written in only one language.<\/p>\n\n\n\n<p>This update of the report aims to demonstrate results from the language detection made on the <strong>section<\/strong> level. However, as sections division is available at the moment only for the two newspapers, the results below cannot be directly compared to the analysis done on the full corpus (namely, the sections are available only for Estonian and Livonian governorate official newspapers, shortened in the dataset as ekmteataja and livzeitung respectively).<\/p>\n\n\n\n<p>The languages of sections were detected with the textcat package as in the page-level analysis.<br>Ten most common languages for sections are mostly the same as the ones detected on the page level; the probability of error is higher for small sections (in particular, the ones that include less than 20 words).<\/p>\n\n\n\n<figure class=\"wp-block-table is-style-stripes\"><table><thead><tr><th>language<\/th><th>n_sections<\/th><th>percentage<\/th><\/tr><\/thead><tbody><tr><td>russian<\/td><td>126937<\/td><td>56.84<\/td><\/tr><tr><td>german<\/td><td>83578<\/td><td>37.42<\/td><\/tr><tr><td>ukrainian<\/td><td>6257<\/td><td>2.80<\/td><\/tr><tr><td>english<\/td><td>1495<\/td><td>0.67<\/td><\/tr><tr><td>estonian<\/td><td>1299<\/td><td>0.58<\/td><\/tr><tr><td>bulgarian<\/td><td>1078<\/td><td>0.48<\/td><\/tr><tr><td>middle_frisian<\/td><td>886<\/td><td>0.40<\/td><\/tr><tr><td>rumantsch<\/td><td>305<\/td><td>0.14<\/td><\/tr><tr><td>latvian<\/td><td>290<\/td><td>0.13<\/td><\/tr><tr><td>belarus<\/td><td>276<\/td><td>0.12<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p>Such unexpected (for Baltic newspapers) languages as English, Bulgarian and Rumantsch appeared to be detection errors for short sections with low OCR quality. Unfortunately, as well as in the case of page input, Ukrainian and Belarus languages are detection errors most probably caused by poor OCR and\/or lack of training data for detecting these languages. The examples taken randomly for each of the false-detected language are shown below.<\/p>\n\n\n\n<figure class=\"wp-block-table is-style-stripes\"><table><thead><tr><th>lang<\/th><th>text<\/th><\/tr><\/thead><tbody><tr><td>belarus-windows1251<\/td><td>Zi\u00fcMv-ZlonnW \u041d\u0430\u0440. \u0438\u0437\u0434. \u042f. \u0412\u0435\u0440\u043c \u0430\u043d\u0430. <br>\u0414\u043b\u044f \u043e\u0437\u043d\u0430\u043a\u043e\u043c\u043b\u0435\u0448\u044f \u043e\u0434\u043d\u0430 \u043a\u043c. \u0432\u044b\u0441\u044b\u044f. \u0431\u0435\u0430\u0430\u043b\u0430\u0442\u043d\u0432. \u0421\u044f\u0431.. \u0424\u043e\u043d\u0442\u0430\u043d\u043d\u0430 ?j.<\/td><\/tr><tr><td>bulgarian-iso8859_5<\/td><td>\u0421\u0442\u0430\u0432 \u041e\u0442\u043f\u0440\u0430\u0432\u043b\u0435\u0448\u044f. \u0446 i \u0432: \u041d\u0430\u0430\u043d\u0430\u0447\u0435\u043d1\u044f. \u043d\u0430\u043a\u043b\u0430\u0434\u043d\u044b\u0445^ <br>\u0412\u0440\u0435\u043c\u044f \u043e\u0442\u043f\u0440\u0430\u0432\u043b\u0435\u043d\u0438\u044f. \u041e\u0442\u043f\u0440\u0430\u0432\u0438\u0442\u0435\u043b\u044c. \u041f\u043e\u043b\u0443\u0447\u0430\u0442\u0435\u043b\u044c. \u0420\u043e\u0434\u044a \u0442\u043e\u0432\u0430\u0440\u0430. \u0412\u042a\u0441\u044a. II. \u0424. <br>\u0426\u0432\u0438\u043d\u0441\u043a\u044a \u041c\u043e\u0434\u043e\u043d\u044a 5043 15 \u0441\u0435\u043d\u0442\u044f\u0431\u0440\u044f \u0425\u0430\u0439\u043c\u044a \u0412\u0430\u043b\u041f\u0440\u0435\u0434\u044a\u044f\u0432\u0438\u0442. <br>\u0412\u0435\u0440\u0435\u0432\u043a\u0438 . . . 5 05 1903 \u0433. \u043b\u0435\u0440\u043d!\u0442\u0435\u0439\u043d\u044a \u0434\u0443\u0431\u043b\u0438\u043a\u0430\u0442\u0430 \u0428\u043f\u0430\u0433\u0430\u0442\u044c \u043f\u0435\u043d\u044c\u043a\u043e\u0432\u044b\u0439 \u2026. 2 20 <br>\u0420\u0438\u0433\u0430 \u0442\u043e\u0432. \u041c\u043e\u0434\u043e\u043d\u044a 92243 13 \u0441\u0435\u043d\u0442\u044f\u0431\u0440\u044f \u041e\u0442\u043e\u043a\u0430\u0440\u044a \u041f\u0440\u0435\u0434\u044a\u044f\u0432\u0438\u0442. <br>\u041a\u043e\u043d\u0434\u044f\u0442\u043e\u0440\u0435\u043c\u044f \u0442\u043e1903 \u0433. \u0412\u043e\u0439\u0442\u0430 \u0434\u0443\u0431\u043b\u0438\u043a\u0430\u0442\u0430 \u0432\u0430\u0440\u044a \u2026. 8 01 \u0420\u0438\u0433\u0430 \u0442\u043e\u0432. <br>\u041c\u043e\u0434\u043e\u043a\u044a 92617 13 \u0441\u0435\u043d\u0442\u044f\u0431\u0440\u044f \u0428\u0430\u0430\u0440\u044a \u0438\u041a\u0430\u041f\u0440\u0435\u0434\u044a\u044f\u0432\u0438\u0442. <br>\u0412\u043f\u043d\u043e \u0432\u0438\u043d\u043e\u0433\u0440\u0430\u0434\u043d\u043e\u0435 7 \u2014 1903 \u0433. \u0432\u043d\u0446\u0435\u043b\u044c \u0434\u0443\u0431\u043b\u0438\u043a\u0430\u0442\u0430 \u041f\u0440\u0430\u0432\u043b\u0435\u043d1\u0435 .<\/td><\/tr><tr><td>english<\/td><td>\u0427\u0410\u0421\u0422\u042c \u041e\u0424\u0424\u0418\u0426\u0418\u0410\u041b\u042c\u041d\u0410\u042f. \u041e\u0442\u0434\u0435\u043b\u044a \u043c\u0435\u0441\u0442\u043d\u044b\u0439. <br>Officieller Theil. Locale Abtheilung.<\/td><\/tr><tr><td>ukrainian-koi8_r<\/td><td>\u041f \u041c\u041c\u0426\u0428 \u0413\u041b\u0412\u042f\u041e\u041f* \u0422\u0428\u042e\u0413\u0420\u0410\u0424\u041c ( \u0417\u044f\u044c\u043c\u043e\u0438\u0433\u0445\u00bb) \u043f\u043e\u0441\u0442\u0443 \u043b\u0438\u0434\u0438 \u0432\u044a \u043f\u0440\u043e\u0434\u0430\u0436\u0443 \u043e\u0442\u0434\u042a\u0430\u044b\u0448\u043c\u0438 \u0431\u0440\u043e\u0448\u044e\u0440\u0430\u043c\u0438; <br>\u041e\u0411\u042f\u0417\u0410\u0422\u0415\u041b\u042c\u041d\u042b\u0419 \u041f\u041e\u0421\u0422\u0410\u041d\u041e\u0412\u041b\u0415\u041d\u042b \u0420\u0438\u0436\u0441\u043a\u043e\u0439 \u0413\u043e\u0440\u043e\u0434\u0441\u043a\u043e\u0439 \u0414\u0443\u043f\u044b. X. <br>\u041e \u0440\u0430\u0441\u043f\u043e\u0440\u044f\u0434\u043a! \u043d\u0430 \u0420\u0438\u0436\u0441\u043a\u0438\u0445\u044a \u0440\u044b\u043d\u043a\u0430\u0445\u0433. XX. <br>\u0414\u043b\u044f \u0437\u0430\u0432\u0435\u0434\u0435\u043d\u0438\u0439 \u0442\u0440\u0430\u043a\u0442\u0438\u0440\u043d\u0430\u0433\u043e \u043f\u0440\u043e\u043c\u044b\u0441\u043b\u0430 \u0432\u044a \u0433\u043e\u0440\u043e\u0434! \u0420\u0438\u0433!. \u0442\u0433\u0433. <br>\u041e \u0441\u043e\u0434\u0435\u0440\u0436\u0430\u043b\u0438 \u043f\u0438\u0435\u043d\u044b\u0445\u0433, \u043f\u043e\u0440\u0442\u0435\u0440\u043d\u044b\u0445\u044a \u0438 \u043f\u0438\u0442\u0435\u0439\u043d\u044b\u0445\u044a \u043b\u0430\u0432\u043e\u043d\u044a, <br>\u0430 \u0440\u0430\u0432\u043d\u043e \u0440\u0435\u043d\u0441\u043a\u043e\u0432\u044b\u0445\u044a \u043f\u043e\u0433\u0440\u0435\u0431\u043e\u0432\u044a \u0441\u044a \u0440\u0430\u0441\u043f\u0438\u0432\u043e\u0447\u043d\u043e\u044e <br>\u043f\u0440\u043e\u0434\u0430\u0436\u0435\u044e \u043a\u0440!\u043f\u043d\u0438\u0445\u0433 \u0438\u0430\u043f\u0438\u0442\u043a\u043e\u0432\u0433. l\u00fcim OO \u043a\u043e\u043f.<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p>Nevertheless, textcat package output seems to be true for the sections labeled as Latvian or Estonian even for cases when several languages are mixed inside a section, e.g.:<\/p>\n\n\n\n<figure class=\"wp-block-table is-style-stripes\"><table><thead><tr><th>lang<\/th><th>text<\/th><\/tr><\/thead><tbody><tr><td>estonian<\/td><td>Kulutus. Liiwlandi Kubbernemango Wattitsus on <br>Patenti l\u00e4bbi sest 16. Oktobrist 1862, Nr. 71 ja Liiwlandi Kubbernemango <br>Tseitungi l\u00e4bbi sest 11. Aprilist 1855, Nr. 42 teada aimud, et <br>\u00fcksp\u00e4inis Priwati- rahwas sedda makso Liiwlandi Kubbernemango <br>Tseitungi eest Posti Kontori jures woiwad sissemaksta; <br>agga et keik kohtud ning seadusse ammetid, kes tahtwad <br>Kubbernemango Tseitungid piddcida, muudkui mitte m\u00f5isa- <br>egga watta-wallitsussed ja kihhelkonnakohtud peawad <br>maksorahhad taieste Kubbernemango Wallitsusse k\u00e4tte saatma, <br>ja et ja watta-wallitsussed ning kihhelkonna - <br>kohtud peawad sedda makso omma kohhalisse <br>Sittakohto jures sissemaksma. Et n\u00fc\u00fcd k\u00fcl se k\u00e4sk sai antud, siiski on <br>mitmel korral Liiwlandi Kubbernemango Wallitsussele teada <br>antud, et paljo m\u00f5isad, selle assemel, et nad peaksid <br>omma makso Sillakohta jures sissemaksma, saggedaste <br>agga iittestumnstanud, et nemmad tahtwad sedda makso <br>Liivlandi Kubbernemango Tseitungi w\u00e4ljaandja jures stssemaksta: <br>agga kui j\u00e4rrele n\u00f5utakse, kas on ka<\/td><\/tr><tr><td>latvian<\/td><td>!!Walmeer\u00e2!! Preeksch wifadeem godeem peedahwa grun, <br>tiguwihnu, konjaku. araku, ruviu. likeeruS un wisadus fchehlHinuS <br>no teem wisslaweliateem wihna pagrabcem un <br>spirtus destilaturahm par mehreenu zenu pee laipnigaS un <br>labas apdeenefchanas. Pee pllnigahm eepirkfchanahm ari <br>apfofu traukus bes kahdas atlihdsinafchanaS aisdot. <br>R. W. M\u00fcller, materialu- un perwju bode, wihna- ua <br>fpirtus kantoris. Bijusch\u00e4 meesneeka Zack k. nam\u00e4 Rr. 90. <br>P. Schilling\u2019\u00ab un jit\u00e4s grahm. bodes dabujama fchahda <br>eewehrojama grahmata: Kreew\u00ab - Tnrku kara-krouika 1877 <br>un 1878. Pehz kcna-prateju un wehsturigahm fi\u00abahm fastahdijis <br>LapaS Mahrti\u00abfch. Schis pilnigais pehdeja kara apraW tfnahks <br>10 lihds 12 burt\u00bb ntzkS ar flawenu generalu un diplomatu <br>bildehm (katra burtnize ir dtwi bildes). 8 burtnizeS Ir jau <br>zita\u00bb ifnahkS drihf\u00e4 laik\u00e4. Kurfch wisaS burtnizeS buhs pirziS, <br>dabuhs par prehmiju teetu j auku bildi \u201eKautinfch pee <br>Schipkas grawaS\u201d\u201d partikai 25 kap. (Tchi bilde roifu raafaff <br>1 rbl. 25 kap.) Tas no wiseem labp<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p>Selecting only Russian, German, Estonian and Latvian sections, the distribution of languages in sections is the following:<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"731\" src=\"https:\/\/digilab.rara.ee\/wp-content\/uploads\/2023\/11\/image-1-1024x731.png\" alt=\"\" class=\"wp-image-3116\" srcset=\"https:\/\/digilab.rara.ee\/wp-content\/uploads\/2023\/11\/image-1-1024x731.png 1024w, https:\/\/digilab.rara.ee\/wp-content\/uploads\/2023\/11\/image-1-300x214.png 300w, https:\/\/digilab.rara.ee\/wp-content\/uploads\/2023\/11\/image-1-768x549.png 768w, https:\/\/digilab.rara.ee\/wp-content\/uploads\/2023\/11\/image-1.png 1344w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"731\" src=\"https:\/\/digilab.rara.ee\/wp-content\/uploads\/2023\/11\/image-3-1024x731.png\" alt=\"\" class=\"wp-image-3120\" srcset=\"https:\/\/digilab.rara.ee\/wp-content\/uploads\/2023\/11\/image-3-1024x731.png 1024w, https:\/\/digilab.rara.ee\/wp-content\/uploads\/2023\/11\/image-3-300x214.png 300w, https:\/\/digilab.rara.ee\/wp-content\/uploads\/2023\/11\/image-3-768x549.png 768w, https:\/\/digilab.rara.ee\/wp-content\/uploads\/2023\/11\/image-3.png 1344w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>Number of sections and words in each language:<\/p>\n\n\n\n<figure class=\"wp-block-table is-style-stripes\"><table><thead><tr><th>Language<\/th><th>Newspapter<\/th><th>N sections<\/th><th>N words<\/th><\/tr><\/thead><tbody><tr><td>estonian<\/td><td>ekmteataja<\/td><td>405<\/td><td>143706<\/td><\/tr><tr><td>estonian<\/td><td>livzeitung<\/td><td>894<\/td><td>780308<\/td><\/tr><tr><td>german<\/td><td>ekmteataja<\/td><td>28797<\/td><td>12644769<\/td><\/tr><tr><td>german<\/td><td>livzeitung<\/td><td>54781<\/td><td>30316848<\/td><\/tr><tr><td>latvian<\/td><td>livzeitung<\/td><td>290<\/td><td>180542<\/td><\/tr><tr><td>russian<\/td><td>ekmteataja<\/td><td>35204<\/td><td>23334143<\/td><\/tr><tr><td>russian<\/td><td>livzeitung<\/td><td>91733<\/td><td>36699193<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p>As sections differ in length, there is a possibility that counting sections (previous plot) might not reflect the actual proportion of languages used in texts of different languages. This assumption can be tested using number of words in each section (LogicalSectionTextWordCount in the metadata). The main assumption is that one section was in most cases written in one language (thus the same method seems less applicable to pages). Resulting distribution is very similar to the section count showed above:<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"731\" src=\"https:\/\/digilab.rara.ee\/wp-content\/uploads\/2023\/11\/image-6-1024x731.png\" alt=\"\" class=\"wp-image-3126\" srcset=\"https:\/\/digilab.rara.ee\/wp-content\/uploads\/2023\/11\/image-6-1024x731.png 1024w, https:\/\/digilab.rara.ee\/wp-content\/uploads\/2023\/11\/image-6-300x214.png 300w, https:\/\/digilab.rara.ee\/wp-content\/uploads\/2023\/11\/image-6-768x549.png 768w, https:\/\/digilab.rara.ee\/wp-content\/uploads\/2023\/11\/image-6.png 1344w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>To compare the results with the page-level language detection, it is possible to filter only ekmteataja and livzeitung from the results gathered previously (stored in data\/01_pages_meta_rus_lang.csv). The main shift from German to Russian and the proportion overall seem roughly same as in the case of sections.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"731\" src=\"https:\/\/digilab.rara.ee\/wp-content\/uploads\/2023\/11\/image-9-1024x731.png\" alt=\"\" class=\"wp-image-3132\" srcset=\"https:\/\/digilab.rara.ee\/wp-content\/uploads\/2023\/11\/image-9-1024x731.png 1024w, https:\/\/digilab.rara.ee\/wp-content\/uploads\/2023\/11\/image-9-300x214.png 300w, https:\/\/digilab.rara.ee\/wp-content\/uploads\/2023\/11\/image-9-768x549.png 768w, https:\/\/digilab.rara.ee\/wp-content\/uploads\/2023\/11\/image-9.png 1344w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>However, the number of pages in German looks very different from the distribution of words in this language gathered from sections. This can be connected with the change of newspapers physical format from smaller to bigger pages and from bigger to smaller font sizes towards the end of the 19th century. Distribution of number of words on pages by decades proves the idea (as there are no physical description of pages in the metadata).<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"731\" src=\"https:\/\/digilab.rara.ee\/wp-content\/uploads\/2023\/11\/image-10-1024x731.png\" alt=\"\" class=\"wp-image-3134\" srcset=\"https:\/\/digilab.rara.ee\/wp-content\/uploads\/2023\/11\/image-10-1024x731.png 1024w, https:\/\/digilab.rara.ee\/wp-content\/uploads\/2023\/11\/image-10-300x214.png 300w, https:\/\/digilab.rara.ee\/wp-content\/uploads\/2023\/11\/image-10-768x549.png 768w, https:\/\/digilab.rara.ee\/wp-content\/uploads\/2023\/11\/image-10.png 1344w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>Section-based analysis thus showed more reliable results, reflecting not only the language use in the newspapers, but also some physical properties of the newspapers, such as page or font size changes over time. The plot shows how number of words per page changed in the second half of the 1860s and then remained more or less stable, so that higher number of pages in German in the 1850s and 1860s counter intuitively witness smaller amount of text per page.<\/p>\n\n\n\n<p>Additionally, the OCR accuracy dependence on the language can be checked on the level of sections.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"731\" src=\"https:\/\/digilab.rara.ee\/wp-content\/uploads\/2023\/11\/image-22-1024x731.png\" alt=\"\" class=\"wp-image-3161\" srcset=\"https:\/\/digilab.rara.ee\/wp-content\/uploads\/2023\/11\/image-22-1024x731.png 1024w, https:\/\/digilab.rara.ee\/wp-content\/uploads\/2023\/11\/image-22-300x214.png 300w, https:\/\/digilab.rara.ee\/wp-content\/uploads\/2023\/11\/image-22-768x549.png 768w, https:\/\/digilab.rara.ee\/wp-content\/uploads\/2023\/11\/image-22.png 1344w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<div style=\"height:25px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<p>Cite the blog post:<br><em>Martynenko, Antonina 2022. Languages in DEA newspaper collection 1850-1918. Eesti Rahvusraamatukogu digilabori juhtumiuuringud. DOI 10.17605\/OSF.IO\/CPQ2W.<\/em><\/p>\n\n\n\n<p><em>Data and code are available here: https:\/\/doi.org\/10.17605\/OSF.IO\/<em><em>CPQ2W<\/em><\/em>.<\/em><\/p>\n\n\n\n<p><em><em>The study has been made as part of the EKKD72 project \"The usage possibilities of textual data in digital humanities case studies on the example of newspaper collections in Estonia (1850-2020)\".<\/em><\/em><\/p>\n","protected":false},"featured_media":3132,"template":"","blogipostituse_kategooria":[],"class_list":["post-3797","blogi","type-blogi","status-publish","has-post-thumbnail","hentry"],"acf":[],"uagb_featured_image_src":{"full":["https:\/\/digilab.rara.ee\/wp-content\/uploads\/2023\/11\/image-9.png",1344,960,false],"thumbnail":["https:\/\/digilab.rara.ee\/wp-content\/uploads\/2023\/11\/image-9-150x150.png",150,150,true],"medium":["https:\/\/digilab.rara.ee\/wp-content\/uploads\/2023\/11\/image-9-300x214.png",300,214,true],"medium_large":["https:\/\/digilab.rara.ee\/wp-content\/uploads\/2023\/11\/image-9-768x549.png",768,549,true],"large":["https:\/\/digilab.rara.ee\/wp-content\/uploads\/2023\/11\/image-9-1024x731.png",1024,731,true],"1536x1536":["https:\/\/digilab.rara.ee\/wp-content\/uploads\/2023\/11\/image-9.png",1344,960,false],"2048x2048":["https:\/\/digilab.rara.ee\/wp-content\/uploads\/2023\/11\/image-9.png",1344,960,false]},"uagb_author_info":{"display_name":"Laura Nemvalts","author_link":"https:\/\/digilab.rara.ee\/en\/author\/"},"uagb_comment_info":0,"uagb_excerpt":"Summary This blog post explores languages inside the \u201cRussian\u201d part of DIGAR Collection. The metadata for newspapers issued before 1918 shows large number of Russian-language newspapers, however both the page and section-level data provide an evidence that the newspapers were not monolingual and some part of the collection labeled as Russian is in fact in&hellip;","_links":{"self":[{"href":"https:\/\/digilab.rara.ee\/en\/wp-json\/wp\/v2\/blogi\/3797","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/digilab.rara.ee\/en\/wp-json\/wp\/v2\/blogi"}],"about":[{"href":"https:\/\/digilab.rara.ee\/en\/wp-json\/wp\/v2\/types\/blogi"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/digilab.rara.ee\/en\/wp-json\/wp\/v2\/media\/3132"}],"wp:attachment":[{"href":"https:\/\/digilab.rara.ee\/en\/wp-json\/wp\/v2\/media?parent=3797"}],"wp:term":[{"taxonomy":"blogipostituse_kategooria","embeddable":true,"href":"https:\/\/digilab.rara.ee\/en\/wp-json\/wp\/v2\/blogipostituse_kategooria?post=3797"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}