xAIM - Text Mining: Topic Modelling

Created by Galina MISHEVA

Updated 21 January 2025

The Text Mining course is an elective course within the eXplainable Artificial Intelligence in healthcare Management (xAIM) master’s programme, an EU-funded, specialised higher education programme, which seeks to address the lack of digital skills training in the healthcare sector and build a talent pool of qualified healthcare experts in AI and computer science.

This is key in the context of Artificial Intelligence (AI) reaching new heights in both deployment and implementation - and the significance of the healthcare sector in Europe for society overall.

Text Mining: Learning outcomes

With this elective course on Text Mining, students will gain unique knowledge of how core machine learning algorithms for text mining are constructed and employed at work. They will also get a thorough overview of key concepts like natural language processing, text mining, or text analysis. Finally, students will learn the basics of undertaking various text-related data mining tasks through visual programming, using Orange as their tool of choice.

Upon successful completion of the course, students will be able to pre-process textual data, understand its specifics and what to look for in it, and learn how to transform raw text to attribute value representation or evaluating language-based models.

Lecture 5: Topic Modelling

Topic modelling is a technique that can help you organise unlabelled documents by discovering latent topics in texts. This technique looks at word distributions and infers topics from them - the more often words appear together, the more likely it is they form a topic.

The 5^th lecture in the Text Mining elective course, 'Topic Modelling' is split into 4 chapters, prepared by Ajda Pretnar Žagar and Blaž Zupan with the support of members of the Bioinformatics Lab at the University of Ljubljana in Slovenia.

Latent Dirichlet Allocation (LDA): an overview of most popular topic modelling techniques: a generative model that begins with randomly assigned topics, which are then iteratively updated based on probabilities of words in a topic.
BERTopic: this chapter offers information on a modern transformer-based topic modelling technique that leverages word embeddings to determine topics.
Topic Modelling Comparison: in this chapter, learners can expect to find a comparative analysis between the 2 techniques introduced in the previous 2 chapters (LDA as a traditional, statistical method best suited for tasks where context is not needed; and BERTopic in its function as a modern approach that manages to capture deep semantic meanings.
Other approaches: in the final chapter of the lecture, students are introduced to other popular topic modelling methods: Latent Semantic Analysis (LSA), Top2Vec, and others.

All 4 chapters in this lecture contain theoretical notes, schemas and practical examples, allowing for a gentle and gradual understanding of the most commonly used topic modelling techniques.