IDRD Corpus details

We have completed the full research corpus by adding another 10 journals to the already compiled corpus of the Global Environmental Change (GEC) research articles. In the same manner as with GEC, only the main body of each paper has been included into the corpus. Thus, all the contents of abstracts, footnotes, boxes, references and appendices have been excluded from the data set.

In total, the corpus consists of 11 journals, and includes 11462 journal articles which amount to more than 53 million word tokens making it of the largest specialised corpora. In comparison, general language corpora like, for example, the British National Corpus (BNC)  and Bank of English (BoE), consist of 100 and 450 million word tokens respectively. Thus, our corpus represents a significant body of both monodisciplianry and interdisciplinary discourse. Continue Reading


IDRD project corpus complete!

This week we have completed the compilation of the full corpus (data set), which we will use in our study of interdisciplinary discourse. The corpus consists of the afore mentioned GEC journal as well as 5 other interdisciplinary and 5 monodisciplinary journals.

Our partners Elsevier have provided us with the journals and contacted their editors to ensure they are happy to cooperate with us on the project. The journals included are: Agriculture, Ecosystems & Environment (AEE), Biosystems (B), Computers, Environment and Urban Systems (CEUS), Environmental Pollution (EP), Global Environmental Change (GEC), Journal of Rural Studies (JRS), Advances in Water Resources (AWR), Journal of Strategic Information Systems (JSIS), Plant Science (PS), Resource and Energy Economics (REE), and Transportation Research Part D: Transport and Environment (TRTE).

We would like to express gratitude to Sarah Huggett and everyone else at Elsevier who has been involved in this process, as well as all the journal editors who accepted to participate in our research.

The details about the corpus are published in the following blog post.


Linked with the Sketch Engine

Today we have established an official partnership with the Sketch Engine. Thanks to the efforts of Adam Kilgarriff and Dr Paul Thompson, IDRD project is officially linked with the Sketch Engine, which will allow  us to undertake a variety of different dorpus investigations. The results of our preliminary analyses will be updated soon!


GEC corpus

The GEC corpus is finally complete! The corpus consists of 569 original research articles published in the Global Environmental Change from 1990 to 2010. This amounts to 3.7 million words tokens and, although we are very happy with this achievement, this is only 1 out of 11 journals we are comparing in our analysis. These other 10 journals will represent 5 discipline specific  and 5 other interdisciplinary journals. At the moment, the team at Elsevier are working hard on identifying these 10 journals from which we’ll compile the full corpus. Thus we expect our final corpus to be between 30-40 million word tokens, which for a corpus linguistic analysis is a massive amount of data.


First steps…

The first task of the project is to prepare the data, namely all the original research articles from the journal of Global Environmental Change (GEC) published between 1991 – 2010. As pre-2001 texts are available only as pdf scans, the first task is to OCR the documents and clean up the texts from meta data. For example, after scanning the documents, footnotes broke up the main text thus affecting our data. Unfortunately, the only process to eliminate these kinds of issues is to manually check all the texts and ensure the transcription is accurate. Furthermore, we are interested mainly in the main body of research articles, rather than abstracts, footnotes, figures or tables, so we will mark up these elements with XML tags in order to observe the main text, but at the same time keep that potentially important information.

Akira has developed R scripts to automatically adapt the xml files provided by Elsevier. But for those pdf files we are marking up the texts manually. This has allowed us to observe some interesting features about the GEC journal in the 1990s. For example, footnotes are used both for references and comments, which we found quite peculiar, especially because this changes in later years. If you noticed a similar practice in some other journal, feel free to drop us a comment below!