Paul Thompson is an invited plenary speaker at the launch event of the Corpus Linguistics in Scotland (CLiS) Network which will take place on Friday, January 23rd 2015 at Moray House School of Education, University of Edinburgh from 2.30-5.30pm. Paul’s talk is entitled ‘Writing between disciplines: corpus approaches to interdisciplinary research discourse’ and he will provide an overview of the project and our latest analyses.
In the previous post we have presented one part of our topic modelling exercise in which we have investigated which of the 100 identified topics were represented in the journal of Global Environmental Change (GEC) more than any other journal. These topics represent the distinctive aspects of the journal and analysing their distribution over the years will allows us to investigate whether and how the journal has changed over the years. By correlating the results of our labelling exercise and multi-dimensional analysis we can observe in what manner, or which type of papers and contexts, these topics are usually discussed. More importantly, as these topics are computationally calculated groups of words, identifying the contexts in which these topics occur in will allow us to interpret the topics themselves. Continue Reading
We have conducted a topic modelling analysis on our corpus of 11 academic journals and created a model with 100 potential ‘topics’. Topics in this sense are collections of words and do not necessarily represent content topics in the traditional sense, like ‘environment protection’ for example. Rather, these topics are groups of words that statistically tend to co-occur in the same paragraphs. Continue Reading
Paper labelling was an exercise we categorised papers published in the journal of Global Environmental Change (GEC) in order to gain further insights into the types of papers that were published in the journal and to aid the interpretation of other analyses we employed. The papers were labelled, or categorised, in a bottom-up approach in which categories were established by reading a number of papers from the journal.
Initially we developed a system with seven categories, but these proved either too specific or too closely related, resulting in poor agreement when we categorised papers independently. Thus the number of categories was reduced and their description broadened. The final set included four categories: 1) Empirical, 2) Policy discussion, 3) Research agenda and Research Framework and 4) Other papers. Using this framework we have achieved a reasonable agreement rate (76.6%) between two researchers. Continue Reading
Topic modelling is a machine learning technique that identifies topics in a given corpus. We assume that a document consists of multiple topics with varying probability, and topic modelling estimates the distribution of topic probability in each document. From a topic model, we can extract keywords of each topic, as well as the distribution of topics in each document.
We ran topic modelling on our corpus consisting of 11 journals. The basic unit of the analysis was paragraph, and multiple paragraphs constituted a text that topic models were built from. We targeted paragraphs and not papers because each paper can consist of multiple topics, and it would be interesting to investigate the transition of topics within papers. Continue Reading