Natural Language Processing Advanced Learning Path - MAI4CAREU Master in AI
Natural language processing (NLP) seeks to provide computers with the ability to intelligently process human language, extracting meaning, information, and structure from text, speech, web pages, and social networks. This curriculum presents a structured learning path that begins with the fundamental elements of NLP systems, advances through the evolving techniques for text representation, introduces the latest deep learning advancements in NLP, and explores their applications in addressing current and relevant issues. It is based on the elective course offered at the Master in Artificial Intelligence of the University of Cyprus, which was developed with co-funding from the MAI4CAREU European project. The course is organised in four parts, which are further categorised into Introductory (four units) and Advanced (seven units) according to their difficulty. The recommended order for studying all materials is the one shown below (from 1 to 11).
Part I: Introduction
- Introduction to Natural Language Processing
- Fundamental Text Pre-Processing
Part II: Language Modeling and Classification
- Language Modelling
- Text Classification
Part III: Vector Semantics and Word Embeddings
- Vector Semantics
- Word Vector Semantics
- Distributed Contextual Embeddings
Part IV: NLP Applications and Advancements
- Use Hybrid Models to Detection Online Hate-speech
- Linguistic Features to Identify Fake News
- Modelling Polarisation in News Media using NLP
- Understanding Large Language Models
MAI4CAREU - Natural Language Processing - Introduction to Natural Language Processing
This lecture serves as an entry point into Natural Language Processing (NLP), providing a panoramic view of the field. It introduces the task of automated extraction of meaning from language across various platforms, the application of NLP in commercial and industrial domains, and its role in conversational systems. The lecture also touches on the challenges of language ambiguity and the use of machine learning models to build NLP tools, setting the stage for a deep dive into the course.
MAI4CAREU - Natural Language Processing - Fundamental Text Pre-Processing
This lecture focuses on the initial steps required to prepare text data for deeper analytical processes. It dives into the use of Regular Expressions (RegEx) to identify and manipulate textual data efficiently, covering practical applications such as searching and modifying text strings in various formats. The lecture also introduces more advanced text processing techniques, such as tokenization methods including Byte-Pair Encoding (BPE) and WordPiece, which are crucial for handling languages that do not use spaces to separate words. Additionally, it addresses the complexities of word normalisation covering methods such as case folding, stop word removal, lemmatization, and stemming. These techniques adjust text data to a standard form, enhancing both the efficiency and accuracy of subsequent NLP tasks.
MAI4CAREU - Natural Language Processing – Language Modeling
This lecture focuses on the development and application of language models that compute the probabilities of sequences of words. The lecture dives into various types of language models, such as unigram, bigram, and trigram models, which are fundamental for tasks like machine translation, spell correction, and speech recognition. It covers the mathematical foundations of language modelling, including the computation of joint probabilities and conditional probabilities using the chain rule. The lecture also explores practical challenges like handling sparse data through techniques such as smoothing and discusses the limitations and generalisation of N-gram models, highlighting their application across diverse NLP tasks.
MAI4CAREU - Natural Language Processing - Text Classification
This lecture covers Text Classification, focusing on the methodologies and applications of classifying text into predefined categories. It covers a variety of classification problems and introduces students to the fundamental concepts such as feature extraction, vectorization techniques like Bag-of-Words (BoW), and classification algorithms including Naive Bayes. The lecture also delves into practical applications, demonstrating how text classification can be applied to areas such as sentiment analysis, spam detection, and topic categorization. Additionally, it explores advanced topics like handling imbalanced data, the importance of feature engineering, and evaluating classifier performance using metrics like precision, recall, and F-measure.
MAI4CAREU - Natural Language Processing - Vector Semantics
This lecture marks the initial part of our exploration into word embeddings in the Natural Language Processing course, exploring the representation of word meanings in multi-dimensional space. It starts by questioning the traditional views of words as mere sequences of characters or indices, instead introducing the concept of lexical semantics, which delves into understanding words, their lemmas, and various senses. The lecture further explains the importance of capturing semantic relationships such as synonymity, antonymy, and hierarchical relationships between words using vector spaces. Techniques like one-hot encoding, Bag of Words (BoW), and more sophisticated methods such as word embeddings that capture nuanced semantic relationships in a dense vector form are covered. This approach allows for a deeper understanding of language that transcends simple word forms, facilitating advanced applications such as machine translation and sentiment analysis.
MAI4CAREU - Natural Language Processing – Word Vector Semantics
The second part of the lecture continues to delve into the practical and theoretical aspects of word embeddings within Natural Language Processing. In this session the focus shifts to the distributional hypothesis, which posits that words which appear in similar contexts possess similar meanings. This part explores various methods of constructing word embeddings, including detailed explanations of techniques like Skip-gram and Continuous Bag of Words (CBOW) from the Word2Vec framework. Additionally, the lecture discusses how embeddings capture semantic and syntactic word relationships, and their application in tasks such as sentiment analysis and machine translation, illustrating the pivotal role of embeddings in modern computational linguistics.
MAI4CAREU - Natural Language Processing - Distributed Contextual Embeddings
In this lecture we move beyond static word representations to explore the dynamic nature of language through distributed contextual embeddings. It delves into how contextual embeddings generated by transformer models like ELMo, GPT, and BERT provide a deeper understanding of word meanings by considering the entire sentence context, which enhances their application in complex NLP tasks. The lecture explains the architecture of these models, such as the transformer mechanism in BERT, which allows for bidirectional understanding, a crucial development over previous unidirectional models. It also covers the practical applications of these models in tasks like text classification, sentiment analysis, and language generation, demonstrating their superiority in handling nuances of language compared to traditional embeddings.
MAI4CAREU - Natural Language Processing – Use Hybrid Models to Detection Online Hate-speech
his lecture, marking the first in a series on NLP applications, delves into the crucial task of hate-speech detection, a complex and timely issue in online communication. It begins by defining hate speech and then highlights the challenges inherent in defining hate speech, which varies significantly across different legal and cultural contexts, complicating detection efforts. The lecture then transitions to a technical exploration of the models used for hate speech detection, introducing students to the basics of Deep Neural Networks (DNNs), Recurrent Neural Networks (RNNs), and Convolutional Neural Networks (CNNs). The integration of RNNs with CNNs is discussed, illustrating how this combination effectively harnesses the spatial pattern recognition capabilities of CNNs with the sequential data processing strength of RNNs for hate-speech detection. Additionally, it discusses the combination of text-, character-, and metadata-level models, and how these can be integrated and complement each other to provide a more effective hate-speech identification.
MAI4CAREU - Natural Language Processing – Linguistic Features to Identify Fake News and Misinformation
The second lecture in the NLP applications series addresses the critical issue of misinformation and disinformation. It begins by defining both terms and discussing the current strategies and challenges in mitigating the spread of fake news, emphasising the sophisticated nature of misinformation and the role of compelling language and source credibility in its dissemination. The core of the lecture explores an advanced system designed to identify misinformation. Among others, the system includes a deep learning model that leverages an extensive set of linguistic features. These features include the number of words, punctuation, sentiment, and vocabulary richness, each contributing to the model's ability to discern legitimate news from fake news effectively. Furthermore, the lecture emphasises the importance of feature selection in optimising the model’s size and efficiency, ensuring it remains potent yet lightweight enough for practical use.
MAI4CAREU - Natural Language Processing - Modelling Polarisation in News Media using NLP
The third and final lecture in the NLP applications series focuses on the phenomenon of polarisation in social and political contexts. It begins by defining polarisation and providing an overview of the theoretical background surrounding the phenomenon. The lecture then explores how Natural Language Processing (NLP) can be utilised to model, analyse and understand polarisation through various techniques. Key NLP tasks such as Named Entity Recognition, Entity Relationship Modeling, and Sentiment Analysis are employed to extract and interpret data from text, identifying polarised topics and the sentiment attitudes of entities towards these topics. This involves analysing the nature of relationships between different entities and the sentiment conveyed in their mentions within news articles or discussions. By constructing models that map out these relationships and sentiments, NLP helps reveal underlying patterns of polarisation.
MAI4CAREU - Natural Language Processing - Understanding Large Language Models (LLMs)
This lecture provides an in-depth exploration of the evolution and functioning of Large Language Models (LLMs) in NLP. It begins by defining what constitutes a LLM, connecting it with the previous lecture on language modelling, and distinguishing between medium-sized models like BERT and RoBERTa and "very" large models such as GPT-3 and GPT-4. It discusses the computational demands and the advanced capabilities of these models, emphasising their application across a broad range of NLP tasks through fine-tuning, and in-context learning, including zero-shot and few-shot techniques. The lecture also covers the pre-training and adaptation processes that enablere these models to perform diverse tasks. It delves deeper into the transformer architecture that underpins these models, discussing the impact of model scale on performance and the emergent properties of larger models. The lecture provides a comprehensive view of how LLMs have transformed the landscape of NLP, offering unprecedented accuracy and efficiency in language understanding and generation.