Estonian culture can thrive in the age of artificial intelligence

Krister Kruusmaa

The National Library’s data scientist Krister Kruusmaa wrote an article in the newspaper Postimees highlighting the importance of data quality in the era of artificial intelligence and the contribution to the evaluation capabilities of local artificial intelligence. The article was inspired by the National Library’s Digital Memory seminar held in March this year.

Let's start with a quote. “Pearu is a neighbor of Andres Tõnisson, described as an honest but somewhat stubborn peasant boy. Andres and Pearu are good friends, but they often have disagreements and competition between them. For example, they both fight for the same piece of land, and Pearu also tries to win the affections of Andres Tõnisson's son Indrek's girlfriend. Pearu doesn't hide his jealousy and hatred towards Indrek because his own daughter Mari doesn't understand Indrek.”

This is how the sensational chatbot ChatGPT answers the question "who is Pearu Anton Hansen Tammsaare’s novel "Truth and Justice?" (both the prompt and response were in Estonian). When asked about one of the most iconic characters in the most important novel of Estonian literature, the chatbot has sparked fear about the future of humanity comes up with absolute gibberish and misinformation. As the discussion about artificial intelligence is a mysterious and futuristic field, the public debate about AI has revolved around dystopian visions familiar from the movie "Terminator," where future humans fight against killer robots for survival.

However, such a framing of the topic is not very reasonable. Although the development of artificial intelligence raises important philosophical questions that undoubtedly require answers, we must not forget more immediate problems that have an impact long before robots take over the world. The dissemination of AI solutions directly affects our language, culture, and identity.

First and foremost, it's about quantity and quality.

Artificial intelligence is not actually something new and, according to most definitions, has already influenced our lives for years. The current breakthrough does not lie in the "invention" of artificial intelligence but in the improvement and increased accessibility of certain types of solutions, which have unexpectedly penetrated into people’s everyday lives.

ChatGPT and the language models of the GPT family behind it have become particularly sensational. Although the world of artificial intelligence is vast, language models hold a special place within it. Language has been considered the most important and inimitable part of human intelligence for centuries, so the rapid development of language models in the past couple of years has surprised even the brightest minds in the field.

Surprisingly, the current progress is not driven by any new scientific discoveries, as the architecture of language models based on neural networks has remained largely unchanged for nearly five years. The success has come in terms of quantity rather than quality – the training of models has involved significantly larger amounts of data and computational power.

Currently, the necessary concentration of resources is mainly achieved by a few American companies such as OpenAI, Microsoft, Google, Meta, etc. As expected, they also have the privilege of choosing the data on which the models are trained. According to language technology experts at the Institute of the Estonian Language, the fact that ChatGPT "knows" the Estonian language is simply a coincidental byproduct – the training data collected from the internet happened to contain enough Estonian material for the mammoth model to learn it to a satisfactory level.

Although ChatGPT's proficiency in the Estonian language is a fortunate gift, there is also cause for concern. One thing is the model's ability to generate correct and fluent Estonian, but another thing is the knowledge embedded in the language. The above paragraph about Pearu is just an anecdotal example (and the model may produce better answers on other occasions), but it serves well to illustrate the problem.

ChatGPT's absurd understanding of Estonia’s most influential literary work indicates that the model fails to fulfill its most important function in the context of Estonian culture and way of life – presenting structured knowledge about the world. The training data simply lacks sufficient Estonian language and topic-specific data to generate reasonable outputs. This general observation holds true both for ChatGPT and the newer GPT-4.

How to Live with a Genie

But why should a chatbot like ChatGPT be familiar with Tammsaare? ChatGPT is currently the fastest-growing software application in history, with 100 million active users in the first two months, despite the fact that the official mobile version has not even been released yet.

The explosive spread of artificial intelligence is not surprising, considering that it provides unprecedented access to knowledge and productivity growth for conscious users. Calls to stop development and Italy's recent attempt to ban ChatGPT may receive media attention, but the genie cannot be put back in the bottle.

Instead of denying it, we must learn to adapt. Whether we like it or not, banning the use of generative language models in education, for example, will be akin to instructing people to use printed encyclopedias instead of Googling.

The concern should not be that students allow AI to write their essays, but that AI produces poor-quality essays. Silicon Valley developers, while dealing with global issues, fail to notice that their models literally generate misinformation about the essence of Estonian culture. In our current situation, we face several significant problems.

Firstly, Estonian users cannot fully benefit from AI – including in the job market. Secondly, AI will not help us solve local problems in the near future because existing models lack the necessary knowledge for that. Thirdly, the cultural bias of models like ChatGPT can be highly detrimental in the long run, directing people's interests towards topics that may not yield accurate answers.

For instance, diligent students find it easier to analyze Jane Austen or Ernest Hemingway than Tammsaare, whose information in the new “encyclopedia” is simply incorrect.

Quality data is crucial.

What should we do to address these problems, and can we make a difference? I propose three steps that could help Estonia and Estonians thrive in the age of AI.

Firstly, we need high-quality Estonian data. As representatives of a small nation and language, we cannot rely on others to discover our data; instead, we must take a proactive approach. As a small culture, we create a limited amount of data every day, so we need to think outside the box.

For example, we can leverage our existing Estonian cultural heritage, which provides necessary depth alongside modern, online information. While Estonia has already been exemplary in digitizing intellectual heritage, we need to increase the pace and consider obtaining machine-readable texts from digitized objects as a priority.

Additionally, we can generate textual data from other media, such as automatically transcribing radio broadcasts. Moreover, many large datasets are currently inaccessible due to passwords or paywalls but could be valuable as training data. Therefore, donating data could become a way of supporting our culture.

Secondly, we must enhance local expertise and development capabilities. In this regard, we are not doing poorly – Estonian researchers and companies working with AI and language technology have nothing to be ashamed of. We can soon expect similar initiatives as in Iceland, where GPT-4 was "taught" about Icelandic language and culture.

However, having a potential "Estonian GPT" in the future would not guarantee that people will embrace the domestic solution. As is customary in the digital world, users tend to gravitate toward major corporations that have an unbeatable advantage in terms of resources and marketing. Since we do not control these corporate applications, we must invest in local AI evaluation capabilities.

AI should hypothetically pass an exam

Large corporations are already extensively involved in testing the ethical biases of artificial intelligence, but they inevitably do so from their own subjective perspectives.

We can assess the suitability of models for our local cultural context and norms only ourselves. To do this, a standardized methodology, a kind of exam, should be developed in collaboration between humanities and computer scientists to evaluate the usability of a particular model, for example, in the Estonian education system or in governance.

Thirdly, we need to recognize the importance of the issue and have the courage to make bold decisions. One does not need to be an oracle to see that the advent of artificial intelligence may become one of the greatest technological revolutions in human history. As a culture, we will soon face uncomfortable choices.

For example, we must decide how much we are willing to offer our data to third parties in order to maximize the benefits of available models for Estonian people. Here, we need to consider both data protection and copyright, but on a larger scale, this is an enduring dilemma for a small nation: whether to fiercely protect our culture and risk being isolated and forgotten or to be as open as possible, despite the resulting dangers.

In seeking a balance, we must not limit ourselves to merely retelling the global debate but dare to articulate our own perspectives. Estonians have previously taken risky steps whose fruits we now consider self-evident. Bringing our language and culture into the world of artificial intelligence and big data is comparable to fundamental tasks such as creating written Estonian language or establishing national higher education.

In addition to the Estonian language and culture, we must also start thinking about the preservation of Estonian data throughout the ages – or rather, language, culture, and data will increasingly become one and the same in the future. Thanks to our advanced digital society, we can afford agile decisions to nurture and preserve them. Just as the spread of the internet and information technology propelled us forward 25 years ago, we must now seize this new turning point.

By acting boldly, we have the opportunity to use artificial intelligence to enhance our culture and society beyond what our small population of Estonian speakers would otherwise allow.


