Explainable Artificial Intelligence in Medicine (xAIM) Learning Path - Text Mining

xAIM - Text Mining course: Introduction to Text Mining

Digital skills

Let's dive into text mining with this first introductive unit. Introduction to Text Mining details the initial steps for preparing text data for analysis. It covers essential concepts like tokenisation, lemmatisation, stemming, and filtering. The process includes filtering on punctuation, stopwords, or applying custom filters. Techniques such as n-grams and POS tagging are also discussed. These steps are crucial for converting raw text into a format suitable for downstream analysis using tools like Orange.

xAIM - Text Mining: Document vectorisation

We make a step further diving into Document Vectorisation, which discusses techniques for converting text documents into numerical vectors suitable for machine learning tasks. It covers two main techniques: Bag of Words (BOW) with TF-IDF transform and Document Embedding. BOW involves counting word occurrences, while Document Embedding uses pre-trained models to create vectors capturing semantic relationships between words. The chapter highlights the advantages and limitations of both methods, including their application contexts and pre-processing requirements. The focus is on transforming text data into formats that algorithms can process effectively.

xAIM - Text Mining: Document Classification

Digital skills

Document Classification explores techniques for categorising text documents into predefined classes or categories using machine learning algorithms. It discusses logistic regression, a simple machine learning classifier, and its application in automated document sorting and classification tasks. It covers converting text into numerical representations, training predictive models, evaluating performance using metrics like classification accuracy and AUC, and making predictions on new data. The chapter includes practical examples, such as classifying Grimm's tales based on their content and predicting tale types for Andersen's stories. The emphasis is on explaining the results of a logistic regression classifier and predicting on new data.

See more materials

Advanced learning materials

Digital Expert

xAIM - Text Mining: Explaining Latent Dirichlet Allocation (LDA)

We start the more advanced part of this part by discovering more about Latent Dirichlet Allocation (LDA). This unit explains what are the assumptions of the LDA method. It provides a detailed step-by-step explanation of the algorithm (based on Gibbs sampling) with an example from medicine. It provides a comparison between Gibbs sampling and variational inference. Finally, it gives an overview of tools that include LDA with additional options that span beyond the described method.

xAIM – Text Mining: Sentiment Analysis

Sentiment Analysis explores techniques for automatically determining the sentiment expressed in text data. It overviews methods such as lexicon-based and machine-learning-based approaches, with the emphasis on the former. The focus is on analysing the emotional valence in text, visualising the results, and extracting relevant documents. Sentiment analysis (or opinion mining) is a task of extracting sentiment from text data. Sentiment comprises the opinion holder (i.e. reviewer) + (time of event) + sentiment target (product, movie, service...) + sentiment (positive, negative, neutral). Furthermore, we are interested in polarity (+/-/0), intensity (high, medium, low), and/or specific emotion (fear, anger, joy, surprise, disgust).

This unit dives into the three approaches to sentiment extraction: lexicon-based, machine learning, hybrid.

xAIM – Text Mining: Semantic Analysis