xAIM - Text Mining: Document Clustering

Created by Juliette Chalant Devlesaver

Posted 13 November 2024

The Text Mining course is an elective course within the eXplainable Artificial Intelligence in healthcare Management (xAIM) master’s programme. As Artificial Intelligence (AI) becomes increasingly important, especially within the healthcare sector, it is becoming crucial to address the lack of digital skills training within the sector. This master’s programme seeks to address this by training qualified healthcare professionals in the field of AI and computer scientists in the field of healthcare.

Text Mining: Learning outcomes

In the Text Mining course, the student will acquire knowledge on the use of the core machine learning algorithms for text mining. With this course, students will be introduced to natural language processing, text mining, and text analysis. They will learn to accomplish various text-related data mining tasks through visual programming. After the completion, students will be able to preprocess textual data, understand specifics of text, transform raw text to attribute-value representation and evaluate language-based models.

Lecture 4: Document clustering

The fourth lecture in the Text Mining course focuses on Document clustering, and is split into three chapters:

Document clustering: this addresses a common requirement in text mining, which is the identification of similar documents.
Comparison of clustering methods: the chapter examines the various clustering methods one can use (such as Hierarchical clustering, k-Means, DBSCAN) and compares the specifics of each approach
Word enrichment: learning about the comparision of a subset of documents against the entire corpus and in order to find statistically significant words for the selected subset.

Each of these chapters includes theoretical information, and is accompanied by practical examples in order to better grasp the learning material. The chapters were prepared by Ajda Pretnar Žagar and Blaž Zupan with the support of members of the Bioinformatics Lab at the University of Ljubljana in Slovenia.

The video below illustrates how to cluster a collection of text documents using a dataset of Grimm's fairy tales. The process involves pre-processing, TF-IDF vectorisation, cosine distance, and hierarchical clustering to identify groups of similar stories. Word enrichment is then applied to understand why certain tales appear in unexpected groups. These techniques offer a practical approach to organising and analysing text data.