IDRD Corpus details

We have completed the full research corpus by adding another 10 journals to the already compiled corpus of the Global Environmental Change (GEC) research articles. In the same manner as with GEC, only the main body of each paper has been included into the corpus. Thus, all the contents of abstracts, footnotes, boxes, references and appendices have been excluded from the data set.

In total, the corpus consists of 11 journals, and includes 11462 journal articles which amount to more than 53 million word tokens making it of the largest specialised corpora. In comparison, general language corpora like, for example, the British National Corpus (BNC)  and Bank of English (BoE), consist of 100 and 450 million word tokens respectively. Thus, our corpus represents a significant body of both monodisciplianry and interdisciplinary discourse. Continue Reading


GEC corpus

The GEC corpus is finally complete! The corpus consists of 569 original research articles published in the Global Environmental Change from 1990 to 2010. This amounts to 3.7 million words tokens and, although we are very happy with this achievement, this is only 1 out of 11 journals we are comparing in our analysis. These other 10 journals will represent 5 discipline specific  and 5 other interdisciplinary journals. At the moment, the team at Elsevier are working hard on identifying these 10 journals from which we’ll compile the full corpus. Thus we expect our final corpus to be between 30-40 million word tokens, which for a corpus linguistic analysis is a massive amount of data.