Interim seminar: topic modelling


Topic modelling is a machine learning technique that identifies topics in a given corpus. We assume that a document consists of multiple topics with varying probability, and topic modelling estimates the distribution of topic probability in each document. From a topic model, we can extract keywords of each topic, as well as the distribution of topics in each document.

We ran topic modelling on our corpus consisting of 11 journals. The basic unit of the analysis was paragraph, and multiple paragraphs constituted a text that topic models were built from. We targeted paragraphs and not papers because each paper can consist of multiple topics, and it would be interesting to investigate the transition of topics within papers.

Based on the topic model with 100 topics, we identified the topics that show prominent patterns of within-paper transition, and looked into the keywords to interpret the patterns. For instance, there were four topics whose probability is high at approximately 30% from the beginning towards the end of papers. When we inspected the keywords of the topics, it was clear that the topics are closely associated with the method section of typical empirical papers, with words like solution, sample, method, analysis, concentration, and collect as keywords. We also looked at the topics that are prominent at the end of the paper (i.e., topics that are typically associated with limitations, future research, and conclusion), those prominent at the beginning of the paper (i.e., topics associated with scene-setting of the study), and the topics whose probability follows a U-shape (i.e., topics that address broader issues that authors mention both at the beginning and the end of the paper).

PDF-download-iconclick here to download

Topic Modelling

Topic Modelling

Leave a Reply

Your email address will not be published. Required fields are marked *