Explainable Artificial Intelligence in Medicine (xAIM) Learning Path - Text Mining
The xAIM project provides a comprehensive learning path designed to equip individuals with the knowledge and skills needed to leverage Explainable Artificial Intelligence (xAI) in healthcare. In collaboration with Goethe University (Germany), Keele University (UK), Leibniz University Hannover (Germany), and the University of Ljubljana (Slovenia), the learning path offers a selected course from the xAIM Master’s program to introduce students to Explainable AI.
The xAIM Master’s program covers core principles across three main areas: healthcare management, artificial intelligence, and ethical and legal considerations. Key topics include the role and applications of AI techniques in the healthcare sector, opportunities and challenges of data-driven approaches in medical environments, methods for analyzing and interpreting complex healthcare datasets and communicating insights to stakeholders, as well as addressing ethical and social implications of AI and new technologies. Additionally, the students develop advanced programming skills, including deep learning, text mining, and computer vision.
Text mining seeks to extract insights, patterns, and knowledge from large sets of textual data, transforming unstructured text into structured information for analysis and decision-making. This curriculum presents a structured learning path that begins with the various techniques for text pre-processing and visualisation, introduces document vectorisation, applies natural language processing approaches to document clustering and classification, and explains topic modelling. It is based on the elective course offered at the xAIM Master of the University of Pavia, developed with co-funding from the xAIM European project.
The course is organised into nine topics, categorised into Introductory (five units) and Advanced (four units). The introductory material covers core concepts of text mining, while advanced units offer further insight into topic modelling, sentiment analysis, keyword extraction, and co-occurrence networks.
Introductory materials
- Introduction to Text Mining
- Document Vectorisation
- Document Classification
- Document Clustering
- Topic Modelling
Advanced materials
- Explaining LDA
- Sentiment Analysis
- Semantic Search
- Document Networks
xAIM - Text Mining course: Introduction to Text Mining
Let's dive into text mining with this first introductive unit. Introduction to Text Mining details the initial steps for preparing text data for analysis. It covers essential concepts like tokenisation, lemmatisation, stemming, and filtering. The process includes filtering on punctuation, stopwords, or applying custom filters. Techniques such as n-grams and POS tagging are also discussed. These steps are crucial for converting raw text into a format suitable for downstream analysis using tools like Orange.
xAIM - Text Mining: Document vectorisation
We make a step further diving into Document Vectorisation, which discusses techniques for converting text documents into numerical vectors suitable for machine learning tasks. It covers two main techniques: Bag of Words (BOW) with TF-IDF transform and Document Embedding. BOW involves counting word occurrences, while Document Embedding uses pre-trained models to create vectors capturing semantic relationships between words. The chapter highlights the advantages and limitations of both methods, including their application contexts and pre-processing requirements. The focus is on transforming text data into formats that algorithms can process effectively.
xAIM - Text Mining: Document Classification
Document Classification explores techniques for categorising text documents into predefined classes or categories using machine learning algorithms. It discusses logistic regression, a simple machine learning classifier, and its application in automated document sorting and classification tasks. It covers converting text into numerical representations, training predictive models, evaluating performance using metrics like classification accuracy and AUC, and making predictions on new data. The chapter includes practical examples, such as classifying Grimm's tales based on their content and predicting tale types for Andersen's stories. The emphasis is on explaining the results of a logistic regression classifier and predicting on new data.
xAIM - Text Mining: Document Clustering
Document Clustering explains methods for grouping similar text documents into clusters without predefined categories. The material discusses different distance metrics like Euclidean and cosine distance. It explains hierarchical clustering with a dendrogram and visualises document similarities using Multidimensional Scaling (MDS). The chapter also compares clustering methods such as k-Means, DBSCAN, and Gaussian Mixture Models, and introduces the Word Enrichment tool to identify characteristic words for document clusters. The focus is on organising unstructured text data into meaningful groups and explaining the found clusters.
xAIM - Text Mining: Topic Modelling
Topic Modeling discusses methods for discovering latent topics within a collection of text documents. It covers Latent Dirichlet Allocation (LDA), a popular technique used to extract and identify themes or topics from unstructured textual data. We contrast LDA with BERTopic, and touch upon Non-Negative Matrix Factorization (NMF) and Latent Semantic Analysis (LSA). Special techniques such as Dynamic Topic Models (DTM) and Structural Topic Models (STM) are also presented. The chapter explains the processes, advantages, and limitations of each method, and includes examples and comparisons. The focus is on providing insights into the underlying themes present in large text corpora, naming topics, and visually exploring topic properties.
xAIM - Text Mining: Explaining Latent Dirichlet Allocation (LDA)
We start the more advanced part of this part by discovering more about Latent Dirichlet Allocation (LDA). This unit explains what are the assumptions of the LDA method. It provides a detailed step-by-step explanation of the algorithm (based on Gibbs sampling) with an example from medicine. It provides a comparison between Gibbs sampling and variational inference. Finally, it gives an overview of tools that include LDA with additional options that span beyond the described method.
xAIM – Text Mining: Sentiment Analysis
Sentiment Analysis explores techniques for automatically determining the sentiment expressed in text data. It overviews methods such as lexicon-based and machine-learning-based approaches, with the emphasis on the former. The focus is on analysing the emotional valence in text, visualising the results, and extracting relevant documents. Sentiment analysis (or opinion mining) is a task of extracting sentiment from text data. Sentiment comprises the opinion holder (i.e. reviewer) + (time of event) + sentiment target (product, movie, service...) + sentiment (positive, negative, neutral). Furthermore, we are interested in polarity (+/-/0), intensity (high, medium, low), and/or specific emotion (fear, anger, joy, surprise, disgust).
This unit dives into the three approaches to sentiment extraction: lexicon-based, machine learning, hybrid.
xAIM – Text Mining: Semantic Analysis
We pass now to Semantic Analysis, which discusses techniques to uncover meanings in unstructured text and organise documents based on conceptual similarity. It covers constructing annotated document maps using tools like t-SNE and Gaussian mixture models. An example with patient notes from PubMed Central demonstrates creating a 2D document map to identify clusters related to different medical conditions.
xAIM – Text Mining: Document Networks
We conclude this learning path with a unit on Document Networks, which discusses co-occurrence networks in computational linguistics, where words (nodes) are connected (edges) if they appear together in documents. The example involves constructing a word network from Grimm's tales, setting parameters for word co-occurrence, and visualising the network to identify central words and their connections. Additional analyses like degree centrality and network clustering are also suggested.