Investigating Global Environmental Change: topic modelling (part one)


We have conducted a topic modelling analysis on our corpus of 11 academic journals and created a model with 100 potential ‘topics’. Topics in this sense are collections of words and do not necessarily represent content topics in the traditional sense, like ‘environment protection’ for example. Rather, these topics are groups of words that statistically tend to co-occur in the same paragraphs. Continue Reading


IDRD Corpus details

We have completed the full research corpus by adding another 10 journals to the already compiled corpus of the Global Environmental Change (GEC) research articles. In the same manner as with GEC, only the main body of each paper has been included into the corpus. Thus, all the contents of abstracts, footnotes, boxes, references and appendices have been excluded from the data set.

In total, the corpus consists of 11 journals, and includes 11462 journal articles which amount to more than 53 million word tokens making it of the largest specialised corpora. In comparison, general language corpora like, for example, the British National Corpus (BNC)  and Bank of English (BoE), consist of 100 and 450 million word tokens respectively. Thus, our corpus represents a significant body of both monodisciplianry and interdisciplinary discourse. Continue Reading


IDRD project corpus complete!

This week we have completed the compilation of the full corpus (data set), which we will use in our study of interdisciplinary discourse. The corpus consists of the afore mentioned GEC journal as well as 5 other interdisciplinary and 5 monodisciplinary journals.

Our partners Elsevier have provided us with the journals and contacted their editors to ensure they are happy to cooperate with us on the project. The journals included are: Agriculture, Ecosystems & Environment (AEE), Biosystems (B), Computers, Environment and Urban Systems (CEUS), Environmental Pollution (EP), Global Environmental Change (GEC), Journal of Rural Studies (JRS), Advances in Water Resources (AWR), Journal of Strategic Information Systems (JSIS), Plant Science (PS), Resource and Energy Economics (REE), and Transportation Research Part D: Transport and Environment (TRTE).

We would like to express gratitude to Sarah Huggett and everyone else at Elsevier who has been involved in this process, as well as all the journal editors who accepted to participate in our research.

The details about the corpus are published in the following blog post.


First steps…

The first task of the project is to prepare the data, namely all the original research articles from the journal of Global Environmental Change (GEC) published between 1991 – 2010. As pre-2001 texts are available only as pdf scans, the first task is to OCR the documents and clean up the texts from meta data. For example, after scanning the documents, footnotes broke up the main text thus affecting our data. Unfortunately, the only process to eliminate these kinds of issues is to manually check all the texts and ensure the transcription is accurate. Furthermore, we are interested mainly in the main body of research articles, rather than abstracts, footnotes, figures or tables, so we will mark up these elements with XML tags in order to observe the main text, but at the same time keep that potentially important information.

Akira has developed R scripts to automatically adapt the xml files provided by Elsevier. But for those pdf files we are marking up the texts manually. This has allowed us to observe some interesting features about the GEC journal in the 1990s. For example, footnotes are used both for references and comments, which we found quite peculiar, especially because this changes in later years. If you noticed a similar practice in some other journal, feel free to drop us a comment below!