How to make a reproducible RaRa Digilab study

Peeter Tinits

24.02.2023

The Open Science Movement has recently grown to greatly influence the way research in the humanities and social sciences is being done. When dealing with computational studies – such as those done in digital humanities or computational social science – an ideal of reproducible research has been proposed: any study should strive to be as easily repeatable and transparent as possible. For computational studies, this means publication of code and data alongside with the research manuscripts.

This can be hard work if done as an afterthought, but fairly easy if kept in mind during the project. Often research articles have included the phrase "data is available upon [reasonable] request" with the good intention to do that. However, even with good intentions, software changes, hard drives crash and other engagements stop people from doing this work. A recent study found that only 6% of researchers were actually able to provide the data they promised to share on request (https://www.nature.com/articles/d41586-022-01692-1 & https://royalsocietypublishing.org/doi/10.1098/rsos.210450). Practically, it is fair to expect that if data is not published with the paper, it is likely very difficult or impossible to gain access the data later on – and even in the best case, it means a lot of additional work for the authors.

Graph Michener et al. 1997

We have compiled a small guide how to structure your data so that it can be easily reused later by you or anyone else, to reproduce the results and build on them in future research. We offer a way to make particularly studies with digitized materials at the National Library of Estonia reproducible and easy to share.

1 Collect your code and data

First step in making the study easy to share and reproduce is simply to gather all the files in one place. In collaborations and longer projects, the conversation tends to distribute between different channels. Not everything needs to be preserved, but this may mean that when reconstructing what was done for a collague, the materials may prove difficult to find (or may have become inaccessible). The first step to avoid this is to gather and keep all the relevant files in one place so that they can be easily shared if needed. If relevant, potentially you may want to keep some files separately, but also in one place for in-group use. For online documents – such as google documents, spreadsheets, etc. – make sure you keep a link around, and make a safe-copy if concluding the project for a while.

1.1 Structure the files

Commonly, when doing data-intensive research on digital collections, you will have some relevant data files, and some relevant processing files. These files should be clearly structured and clearly named. For data files, we would recommend any common file format, e.g. .csv, .tsv, .json, with a name that clearly describes its contents. Don't use 'data.csv', use 'sports_articles_metadata.csv'. For script files, comment your code and add explainers within the code that describe what and why you do. This is both for you and your future readers – you are likely not to remember well what you did when completing the project, and won't need to work to reconstruct these steps again. .Rmd or .ipynb files offer nice ways to organize code into text and code blocks, but any format can work. The code files should also use clear names that indicate their contents. Don't use script1.R, script2.R, but use preprocess_data.R, analyse.R. If the code is run in a particular sequence, it may help to number the files so that they are always in order – e.g. 1_preprocess.R, 2_analyse.R.

1.2 Use a simple directory structure

It is helpful to use a simple standard directory structure for the files. One common way to structure data science projects is to use separate folders for 1) code files, 2) data files, 3) documentation and reports, 4) plots and visualizations. For larger projects, a more fine-grained structure can help (e.g. separate raw data and test data into separate directories). An example structure is given below, but details will depend on the particular project.

📦project
┣ 📂 code
┃ ┣ 📜 1_preprocess.Rmd
┃ ┗ 📜 2_analyse.Rmd
┣ 📂 data
┃ ┣ 📂 raw
┃ ┃ ┣ 📜 raw_corpus.txt
┃ ┃ ┗ 📜 raw_corpus_metadata.tsv
┃ ┗ 📂 processed
┃ ┃ ┣ 📜 testset_simple.tsv
┃ ┃ ┗ 📜 testset_detailed.tsv
┃ ┗ 📂 annotated
┃ ┃ ┣ 📜 testset_detailed_annots_UT.tsv
┃ ┃ ┗ 📜 testset_detailed_annots_CD.tsv
┣ 📂 docs
┃ ┣ 📜 logs.md
┃ ┣ 📜 blogpost.md
┃ ┗ 📜 article1.md
┣ 📂 plots
┃ ┣ 📜 results1.png
┃ ┗ 📜 results2.png
┣ 📜 LICENSE
┗ 📜 readme.md

Changing the directory structure will change your file paths, so make sure you keep them up to date. A best practice will be to keep the working directory at the project root and refer to files from there. For R, this can be assisted with the package 'here'.

1.3 Store and download the data

When dealing with the DIGAR text data, the same query can give the next user the same dataset. However only if the raw dataset on the server is the same – and if the dataset is compiled across general features (e.g. all Estonian newspapers in the 1920s), then if there are any new sources added to the collections, the query will not exactly replicate your initial query anymore.

For this reason it is necessary to store the data that you got in your query as a separate file to be added to the code and data. If using only open data, then there are no issues in just including the raw data in the set. If using data that you accessed through a contract, then including the article ids or publication names can do part of the job.

1.4 Zip what you can

If your study uses large data files – e.g. queries across the large corpus, it may become unwieldy to publish them in raw format. However they are easy to compress – e.g. raw text files can be zipped up to use roughly 40% of the original size. Most data readers can decompress them on the go, causing no delay in processing. E.g. see some options here for R – https://www.roelpeters.be/how-to-read-zip-files-in-r/ (e.g. package vroom for tidyverse). An example code that also zips up the raw text files in an R notebook is available here.

1.5 Save images as separate files

Usually when analysing data, the results are also given in a visual overview. While there are many ways to extract this from the code and introduce them in the manuscript (e.g. just exporting and pasting), the best way to do this is to save the visualization into an image file – e.g. .png, .svg, or .pdf depending on which format you need. And then, if needed, it can be linked into your manuscript, copied into a word file, or is often added directly by the editors. This allows the image to follow the same parameters and be the same size across all platforms, and also preserve this shape when creating minor adjustments. If using vector graphics (e.g. .svg, .pdf) you will also be able to zoom into the graph without losing quality.

1.6 Store intermediary steps, all steps you checked

Often your data analysis workflow will include steps where the results are manually checked and annotated. These are important creative steps and decisions that are also good to keep a record of: in case you need to backtrack the steps you made, or if a collague wants to follow up with another study. E.g. if you create a clean test set of bicycle advertisements or if you annotate sentences on their grammatical forms. For transparent research, these should all be preserved along with the raw data and code. They may also serve as derived datasets that can be useful for future studies.

1.7 Rerun study

Once you have gathered all the files in one place and adjusted their locations and contents, it is important to check if your files still run. It is much easier for you to troubleshoot any errors now quickly than any reader to try this (but likely they will give up quickly). Rerun the code files once again to make sure they work (even if you don't update any data files here).

2 Store and document the contents

2.1 Store your materials in a safe storage space for preservation

There are a number of ways to share the files online, some more permanent than others. It is best to choose a storage location that is built to store scientific data – they usually have some frameworks in place that ensure the long-term preservation and accessibility of the contents. Personal websites or file storage services like Dropbox or OneDrive tend to move, get updated, change in structure so that links that worked on publishing will be broken.

Some good ones used for cultural heritage are OSF or Zenodo, and an Estonian scientific infrastructure DataDOI. We highly recommend OSF that has a framework and funding in place to preserve the data for 50 years. These services also give the repository a DOI – a permanent Digital Object Identifier that will adapt and link to the repository even if its hosting website changes.

Some examples on OSF made on the DIGAR collections can be seen here: https://osf.io/hbfmy/.

2.2 Store the information on software versions used

The code used in scientific computing is subject to frequent updates and changes. Thus it may be important to know exactly which package version was used to run your study. It may be that the latest version works fine and identically, but in some cases, the commands used may become obsolete, or some errors used within packages that change computation methods.

In R, it is possible to run sessionInfo() in the .Rmd file and copy the results, or use the command packageVersion("packagename"). In Python, there is an equivalent command

import session_info
session_info.show()

For more advanced ways of doing this, have a look at the packrat package https://rstudio.github.io/packrat/. Some more ideas on storing this information, and also citing the packages used is given here https://ropensci.org/blog/2021/11/16/how-to-cite-r-and-r-packages/.

2.3 License

To be clear about what any future users can do with your code, you need to include a license file in your scientific repository. This license file usually contains the info of a standard software license (e.g. MIT or GPL), and/or a license for textual materials (e.g. CC-BY). Get to know the different licenses here: https://help.osf.io/article/148-licensing or here: https://help.figshare.com/article/what-is-the-most-appropriate-licence-for-my-data or here: https://www.kent.ac.uk/guides/open-access-at-kent/choose-a-licence-for-your-research-works. If you do not specify it, then a diligent user must conclude that this code is not for open use and is copyrighted to be restricted by the author. In scientific computing, this may mean that the potential user will move on and not use this data just in case.

3. Don't wait until the end of the project

These steps can be very easy to follow if done during the project, but quite hard work if done after the project – which is often the reason that they will never be done at all: there is often no time for this when the project has concluded. Which also means that likely your code and data will not be "available upon [reasonable] request" even if you meant to allow this.

3.1 Keep your code ready at all intermediate steps.

Most first drafts never get updated and cleaned up because the work needed to do that grows as time passes from when the code was written. Sometimes your code needs outside input and this becomes impossible to get or the project member responsible for writing a part of it is not available for comment at the particular time you need it. Do this at every intermediate step of the work, when something worth preserving is completed. Do not wait until the end of the project to clean up your code.

Quotes from Minocher et al. 2021 https://royalsocietypublishing.org/doi/10.1098/rsos.210450#d1e1219

3.2 Develop the materials with an end goal in mind

When working in scientific projects, clear overviews of authoring and what is aimed to be done with the project greatly simplify any possible tensions from different ideas. Does the project aim to have open data? Then open data practices are very helpful to use from the beginning. Does the project involve a computational aspect and a theoretical aspect done by different people? It may be useful to clarify the order of authors and the aimed eventual publication venues, as authors with different backgrounds may have different ideas here (e.g. some disciplines use alphabetic authorship, some really encourage first authorship, some allow for "equal authorship", some even suggest more complex frameworks, e.g. credit https://credit.niso.org/). As tasks emerge during the project, it will be clearer who should do them and for what aims.

3.3 Keep contact with the intended audience

You may be doing this research for a particular discipline or a societal group. They may be interested in particular aspects of your data. Computational researchers may want to know about statistical distributions in your data, social scientists may want to know if your data was representative to the project at hand. Humanities scholars may want to know if this is really new information. Keep these expectations in mind when developing the project and make sure these bases are covered. Also think about how and if you may want to share this with the more general public. E.g. if you use visual materials, then it may help to create specific ones for popular science purposes, and the easiest time to do that is during the project.