IDRD Corpus details

We have completed the full research corpus by adding another 10 journals to the already compiled corpus of the Global Environmental Change (GEC) research articles. In the same manner as with GEC, only the main body of each paper has been included into the corpus. Thus, all the contents of abstracts, footnotes, boxes, references and appendices have been excluded from the data set.

In total, the corpus consists of 11 journals, and includes 11462 journal articles which amount to more than 53 million word tokens making it of the largest specialised corpora. In comparison, general language corpora like, for example, the British National Corpus (BNC)  and Bank of English (BoE), consist of 100 and 450 million word tokens respectively. Thus, our corpus represents a significant body of both monodisciplianry and interdisciplinary discourse.

Although we are focused on comparing monodisciplinary (MDR) and interdisciplinary research (IDR) discourse , we are also including a diachronic dimension into our analysis, i.e. how IDR discourse change over time. The figure below shows the number of papers published in all the monodisciplinary and interdisciplinary journals per year, as well as the total values.


The number of papers does not tell us much about their word length, i.e. the actual amount of textual data which we are going to analyse. As we can observe from the figure above, the number of papers published per year in each of these journals varies. Consequently, the amount of textual data representing each of these journal and volumes changes over time too. The figure below show the total number of words published in each of the journals per year.


From the figure above, we can observe that the total size of our corpus exceed 53 million words, but also that interdisciplinary discourse represents a larger proportion of the corpus. Nevertheless, the size of MDR sub-corpus is large enough to represent monodisciplinary discourse, making the two sets comparable.

However, for the purpose of multidimensional analysis (MDA), the size of individual texts is somewhat more important than the size of the corpus. Namely, for the identification of dimensions, text have to be at least 2,000 words long in order to ensure there is enough data to represent each particular text type. Although, journal articles are usually longer than 2,000 words, that is not necessarily always the case in the papers included in our data set. The figure below shows the average length of papers across journals, as well as the number of paper that fall below the 2,000 word mark.


The solid line in the above figures represents the 2,000 word mark and shows the number of papers in each journal that fall bellow that mark. For the purpose of MDA, all papers below that mark have been excluded from this particular analysis. Luckily, in most of the journals more than 90% of the papers are above that mark.

The dotted line in the above figures represents the average length of papers in each of the journals. This value shows that in certain disciplines, papers are generally shorter or longer than in others. For example, papers published in Journal of Rural Studies (JRS) and Journal of Strategic Information Systems (JSIS) journals are on average much longer than papers in Plant Science (PS) or Environmental Pollution (EP).

To conclude, our corpus represents a vast amount of data representing research discourse. We have already started our analyses and will keep you updated on our progress on this blog. If you have any questions regarding our data or methods of analysis please feel free to leave a comment or contact us directly.

Leave a Reply

Your email address will not be published. Required fields are marked *