xAIM - Text Mining: Document vectorisation

Created by Laia Güell Paule

Updated 19 December 2024

The Text Mining course is an elective course within the eXplainable Artificial Intelligence in healthcare Management (xAIM) master’s programme. As Artificial Intelligence (AI) becomes increasingly important, especially within the healthcare sector, it is becoming crucial to address the lack of digital skills training within the sector. This master’s programme seeks to address this by training qualified healthcare professionals in the field of AI and computer scientists in the field of healthcare.

Text Mining: Learning outcomes

In the Text Mining course, the student will acquire knowledge on the use of the core machine learning algorithms for text mining. With this course, students will be introduced to natural language processing, text mining, and text analysis. They will learn to accomplish various text-related data mining tasks through visual programming. After the completion, students will be able to pre-process textual data, understand specifics of text, transform raw text to attribute-value representation and evaluate language-based models.

Lecture 2: Document vectorisation

The Document vectorization lecture is divided into three chapters:

Why vectorize text? An introduction to the need for converting text into numerical vectors for computational analysis.
Common vectorization methods: covers key techniques such as:
- Bag-of-Words (BoW) for frequency-based text representation.
- TF-IDF for emphasizing important terms.
- Word embeddings (e.g., Word2Vec) for capturing semantic relationships.
Applications of vectorization: explores the use of vectorized text in tasks like clustering, classification, and sentiment analysis.

Each chapter includes theoretical explanations and practical examples to reinforce understanding. The chapters were prepared by Ajda Pretnar Žagar and Blaž Zupan with the support of members of the Bioinformatics Lab at the University of Ljubljana in Slovenia.

In the video below, you'll focus on the bag-of-words technique, which represents text using token frequencies, making it easier to interpret than embedding methods.

In this second video, we will take a look at document embeddings in text analysis. We learn how to load datasets, embed text, classify documents, and interpret results using techniques such as t-SNE visualisation.