{"id":3802,"date":"2024-08-16T11:32:13","date_gmt":"2024-08-16T08:32:13","guid":{"rendered":"https:\/\/digilab.rara.ee\/andmestikud\/ocr-tekstiparandused\/"},"modified":"2025-01-21T21:10:01","modified_gmt":"2025-01-21T18:10:01","slug":"ocr-tekstiparandused","status":"publish","type":"andmestikud","link":"https:\/\/digilab.rara.ee\/en\/datasets\/ocr-tekstiparandused\/","title":{"rendered":"OCR Text Corrections"},"content":{"rendered":"\n<p>Over the years, nearly 500 users have contributed to correcting the texts in DIGAR, and as a result of their efforts, more than <a rel=\"noreferrer noopener\" href=\"https:\/\/dea.digar.ee\/?a=p&amp;l=en&amp;p=textcorrecthome&amp;e=-------et-25--1--txt-txIN%7ctxTI%7ctxAU%7ctxTA-------------\" target=\"_blank\">800,000 lines<\/a> have been corrected. This dataset contains a selection of pairs of original texts and their corrections.<\/p>\n\n\n\n<p><strong>The dataset, along with its documentation, can be found here<\/strong>: <a href=\"https:\/\/zenodo.org\/records\/13325713\" target=\"_blank\" rel=\"noreferrer noopener\">https:\/\/zenodo.org\/records\/13325713<\/a><\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1280\" height=\"496\" src=\"https:\/\/digilab.rara.ee\/wp-content\/uploads\/2024\/08\/pilt14_est.png\" alt=\"\" class=\"wp-image-3580\" srcset=\"https:\/\/digilab.rara.ee\/wp-content\/uploads\/2024\/08\/pilt14_est.png 1280w, https:\/\/digilab.rara.ee\/wp-content\/uploads\/2024\/08\/pilt14_est-300x116.png 300w, https:\/\/digilab.rara.ee\/wp-content\/uploads\/2024\/08\/pilt14_est-1024x397.png 1024w, https:\/\/digilab.rara.ee\/wp-content\/uploads\/2024\/08\/pilt14_est-768x298.png 768w\" sizes=\"auto, (max-width: 1280px) 100vw, 1280px\" \/><figcaption>The collection of text corrections in the DIGAR environment has been carried out through collaborative creation.<\/figcaption><\/figure><\/div>\n\n\n<div class=\"wp-block-group\"><div class=\"wp-block-group__inner-container is-layout-flow wp-block-group-is-layout-flow\">\n<h2 class=\"wp-block-heading\">Preprocessing<\/h2>\n\n\n\n<p>The text corrections in the DIGAR archive are saved as change logs, meaning the original text has been reverse-engineered, with the corrected parts replaced by the original content. The texts are heavily filtered. Specifically, only text correction pairs that meet the following criteria are included:<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>The corrected text contains at least 80% alphabetical characters.<\/li><li>The difference in length between the original texts and the corrected texts does not exceed 5%.<\/li><li>The relative Levenshtein distance between the two texts is at least 0.1.<\/li><\/ul>\n\n\n\n<p>These criteria are used to exclude texts that are partially edited, contain too many numbers, lists, or other non-alphabetical symbols, or where significant parts have been deleted or added (often to correct segmentation errors).<\/p>\n<\/div><\/div>\n\n\n\n<h2 class=\"wp-block-heading\">Quality Assessment<\/h2>\n\n\n\n<p>Since the corrections are the result of collaborative creation, they may contain errors and should not be considered the final truth. To provide a rough overview of the quality of the corrected texts, both the original and corrected texts have been processed through GPT-4o mini, which assigned them a readability score ranging from 1 to 5. The following scale was used for this assessment:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>The following is the OCR output from a digitized historical Estonian newspaper from {year}. Analyze the text placed after \"TEXT\" and decide if it is reasonably free of OCR errors. Return a rating on the scale of 1 to 5.\n\n5 - The text is clear and readable. It may contain unusual spellings and use of punctuation throughout, but there are no distorted words.\n4 - The text is readable, but contains some distortions of alphabetical characters. These distortions do not impede understanding the text at any given point.\n3 - The text is readable with minor difficulties. Words and phrases may be noticeably distorted.\n2 - The text is only readable with great difficulties. All or almost all sentences contain severe errors that make it very hard to understand.\n1 - The text is unreadable. It contains mostly gibberish and random symbols, almost no words are recognizable.\n\nIf you are hesitating between 4 and 5, it is probably a 5. If you are hesitating between 2 and 3, it is probably a 2.\n\nNote: the use of \"w\" instead of \"v\" and \"=\" instead of \"-\" are elements of historical orthography an do not count as errors.\n\nDo not reply anything else than a number from 1 to 5, unless explicitly asked to do so.\n\nTEXT:\n{ocr_transcription}<\/code><\/pre>\n","protected":false},"featured_media":3583,"template":"","format":[],"meta":{"_acf_changed":true,"_uag_custom_page_level_css":"","_members_access_role":[],"_members_access_error":""},"postcategory":[],"copyright":[],"tags":[],"paritolu":[],"andmestiku_tyyp":[],"class_list":["post-3802","andmestikud","type-andmestikud","status-publish","has-post-thumbnail","hentry"],"acf":[],"uagb_featured_image_src":{"full":["https:\/\/digilab.rara.ee\/wp-content\/uploads\/2024\/08\/pilt14_est_small.png",358,160,false],"thumbnail":["https:\/\/digilab.rara.ee\/wp-content\/uploads\/2024\/08\/pilt14_est_small-150x150.png",150,150,true],"medium":["https:\/\/digilab.rara.ee\/wp-content\/uploads\/2024\/08\/pilt14_est_small-300x134.png",300,134,true],"medium_large":["https:\/\/digilab.rara.ee\/wp-content\/uploads\/2024\/08\/pilt14_est_small.png",358,160,false],"large":["https:\/\/digilab.rara.ee\/wp-content\/uploads\/2024\/08\/pilt14_est_small.png",358,160,false],"1536x1536":["https:\/\/digilab.rara.ee\/wp-content\/uploads\/2024\/08\/pilt14_est_small.png",358,160,false],"2048x2048":["https:\/\/digilab.rara.ee\/wp-content\/uploads\/2024\/08\/pilt14_est_small.png",358,160,false]},"uagb_author_info":{"display_name":"Laura Nemvalts","author_link":"https:\/\/digilab.rara.ee\/en\/author\/"},"uagb_comment_info":0,"uagb_excerpt":"Over the years, nearly 500 users have contributed to correcting the texts in DIGAR, and as a result of their efforts, more than 800,000 lines have been corrected. This dataset contains a selection of pairs of original texts and their corrections. The dataset, along with its documentation, can be found here: https:\/\/zenodo.org\/records\/13325713 Preprocessing The text&hellip;","_links":{"self":[{"href":"https:\/\/digilab.rara.ee\/en\/wp-json\/wp\/v2\/andmestikud\/3802","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/digilab.rara.ee\/en\/wp-json\/wp\/v2\/andmestikud"}],"about":[{"href":"https:\/\/digilab.rara.ee\/en\/wp-json\/wp\/v2\/types\/andmestikud"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/digilab.rara.ee\/en\/wp-json\/wp\/v2\/media\/3583"}],"wp:attachment":[{"href":"https:\/\/digilab.rara.ee\/en\/wp-json\/wp\/v2\/media?parent=3802"}],"wp:term":[{"taxonomy":"format","embeddable":true,"href":"https:\/\/digilab.rara.ee\/en\/wp-json\/wp\/v2\/format?post=3802"},{"taxonomy":"postcategory","embeddable":true,"href":"https:\/\/digilab.rara.ee\/en\/wp-json\/wp\/v2\/postcategory?post=3802"},{"taxonomy":"copyright","embeddable":true,"href":"https:\/\/digilab.rara.ee\/en\/wp-json\/wp\/v2\/copyright?post=3802"},{"taxonomy":"tags","embeddable":true,"href":"https:\/\/digilab.rara.ee\/en\/wp-json\/wp\/v2\/tags?post=3802"},{"taxonomy":"paritolu","embeddable":true,"href":"https:\/\/digilab.rara.ee\/en\/wp-json\/wp\/v2\/paritolu?post=3802"},{"taxonomy":"andmestiku_tyyp","embeddable":true,"href":"https:\/\/digilab.rara.ee\/en\/wp-json\/wp\/v2\/andmestiku_tyyp?post=3802"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}