The first task of the project is to prepare the data, namely all the original research articles from the journal of Global Environmental Change (GEC) published between 1991 – 2010. As pre-2001 texts are available only as pdf scans, the first task is to OCR the documents and clean up the texts from meta data. For example, after scanning the documents, footnotes broke up the main text thus affecting our data. Unfortunately, the only process to eliminate these kinds of issues is to manually check all the texts and ensure the transcription is accurate. Furthermore, we are interested mainly in the main body of research articles, rather than abstracts, footnotes, figures or tables, so we will mark up these elements with XML tags in order to observe the main text, but at the same time keep that potentially important information.
Akira has developed R scripts to automatically adapt the xml files provided by Elsevier. But for those pdf files we are marking up the texts manually. This has allowed us to observe some interesting features about the GEC journal in the 1990s. For example, footnotes are used both for references and comments, which we found quite peculiar, especially because this changes in later years. If you noticed a similar practice in some other journal, feel free to drop us a comment below!