xAIM - Text Mining: Document Classification

Created by Galina MISHEVA

Updated 21 January 2025

The Text Mining course is an elective course within the eXplainable Artificial Intelligence in healthcare Management (xAIM) master’s programme, an EU-funded, specialised master's, which seeks to address the lack of digital skills training in the healthcare sector and build a talent pool of qualified healthcare experts in AI and computer science. This is key in the context of Artificial Intelligence (AI) reaching new heights in both deployment and implementation - and the significance of the healthcare sector in Europe for society overall.

Text Mining: Learning outcomes

With this elective course on Text Mining, students will gain unique knowledge of how core machine learning algorithms for text mining are constructed and employed at work. They will also get a thorough overview of key concepts like natural language processing, text mining, or text analysis. Finally, students will learn the basics of undertaking various text-related data mining tasks through visual programming, using Orange as their tool of choice.

Upon successful completion of the course, students will be able to pre-process textual data, understand its specifics and what to look for in it, and learn how to transform raw text to attribute value representation or evaluating language-based models.

Lecture 3: Document Classification

The 3^rd lecture in the Text Mining elective course, 'Document Classification' is split into 4 chapters, developed with the support of members of the Bioinformatics Lab at University of Ljubljana, Slovenia:

Document classification: learn how to predict features using the Aarne-Thompson type (ATU), or index, of folk-tale motifs - using the corpus of the Grimm brothers' tales as an example.
Logistic regression: an overview of one of the most popular machine learning methods used in text mining for its speed and predictive performance and how it works in practice (i.e. 'letting the words "vote").
Model evaluation: this chapter takes learners through the ways to assess and quantify the performance of the model and the extent to which it follows previously-defined logical schemas.
Predictions: this chapter focuses on predicting tale types using 3 stories by Hans Christian Andersen (the author of the Little mermaid) as examples.

All 4 chapters in this lecture contain theoretical notes, data schemas, and visual graphs, together with practical examples and screenshot of main steps and processes explained. The chapters were prepared by Ajda Pretnar Žagar and Blaž Zupan with the support of members of the Bioinformatics Lab at the University of Ljubljana in Slovenia.

The video demonstrates the creation of an explainable classifier to categorize BBC3 news articles using logistic regression and TF-IDF vectorisation. It explains how to interpret the model's predictions with a nomogram, test it on new articles, and explore methods to refine text classification techniques.