Data Advanced Learning Path - Transforming data into knowledge
The first learning path on “Data analytics” introduced the basics of the field. In this second learning path, we will dig much deeper into advanced data analytics. This broad term covers many activities that all revolve around the transformation of data into knowledge. In other world, analyzing raw data beyond common statistical analysis, and use it to establish a generic model that allows to make predictions, or classify new data or group similar data together (also known as data clustering). Advanced analytics is used in many fields, to optimize activities and processes, provide reliable forecasts or improve the understanding of the surrounding world. The boundary between advanced data analytics and data science is a blurred one; both fields share algorithms and approaches. But data science goes way beyond the methods that will be presented here, including very advanced data processing like neural networks, deep learning, Natural Language Processing… This second learning path will be much more technical and practical than the first one, and will include detailed description of some mathematical methods as well as tools to efficiently implement them.
An Introduction to Scikit-Learn: Machine Learning in Python
Let us first start by introducing tools that help performing data analytics tasks. Scikit-learn is an open-source library of tools developed for the Python programming language. It offers simple and efficient tools for predictive data analysis, like logistic regression, clustering, density estimation… as well as tools for data transformation and visualization. You can complete this reading with this other tutorial. If you don’t feel like installing python and all related packages on your own computer, you can simply go to Google Colab. It is an online platform provided by Google to develop and run your code directly on your browser. An alternative is to install Jupyter Lab, which runs on your own computer and provides a user-friendly environment to write and organize your code.
D3.js tutorial
Another useful tool for data visualization is D3.js. D3.js is a widely-used, free, open-source JavaScript library for visualizing data. It relies on open standards like SVG and Canvas and is mainly meant to be used on the web. It is a low-level toolbox that allows to compose the tools into a process that is truly useful to you. It supports a wide variety of graph types, data ingestion, user interactions and layouts.
This tutorial takes you through the main components of D3.js and its prominent use cases. It also includes some additional resources. Another introductory resource is offered here.
Linear regression: House Price Prediction in Python
Let’s now dive into a simple yet powerful data analytics method, that looks at existing data, derives a model from that data and allows to make predictions about previously unseen data. It is also called a regression algorithm.
For instance, imagine you have collected data from a buying preferences of a supermarket’s customers database. For each customer, you know where he lives, his age, and may be a few more information; you also know how much he spends in your shop. By analyzing this kind of data, you can derive a model of a customer, i.e. how a typical customer behaves, according to his characteristic, and thus you can predict how much a new customer would spend, according to his living place, his age and a few more infos… This method is called Linear regression, and this video introduces it and presents its implementation using scikit-learn.
k-Nearest Neighbours algorithm
After looking at algorithm that allows to predict values for previously unseen data, i.e. a regression algorithm, let us now look at an approach to identify the class of data, based on existing data. This is called a classification problem, for instance, telling cats and other animals apart, or predicting the vote casted by a citizen, or the rating a user would give a song on a streaming platform. One way to achieve this type of task is by looking at the ’neighbors’ of the new data you want to classify, and based on their class or choice, device the one for your new data. This may sound complicated, but after reading this article, everything will get clearer. And for a scikit-learn implementation of the algorithm, you can watch this video.
k-Means algorithm for data clustering
So far, we have looked at labelled data, i.e. data already associated with a value or a class. In Machine Learning terms, it is called supervised learning. But now, imagine you have data that has no such label, like a raw list of customers for instance. You still may want to check if some patterns are shared by groups of customers. This task is called ‘clustering’, i.e. identifying groups of data that share similarities, that are ‘close’ to each other. It could be groups of products or group of users; this would then allow for instance to build a recommender system by proposing to a customer products that are in the same group as those he previously bought. In machine learning terms, this is called unsupervised learning, because we work with unlabelled data.
In this article, you will learn about the principles of the k-Means algorithm; some animations will help you correctly understand it. Then, you will get a description of an implementation of the algorithm using scikit-learn.
You can find additional information in this other article or in this video.
Python for Data Science and Machine Learning Bootcamp
If you want to dig further into the technical tools that were presented earlier, to deepen your knowledge of the methods we saw and to learn about other ones, just follow this online course. It covers some of the elements learned previously, but in a more detailed way, and also addresses other types of methods, like Decision trees, Principal Component Analysis, Natural Language Processing and others… It is definitely a very nice way to take your skills to the next level.
Mastering data visualization in D3.js
If you are more into the visualization part, this online course is also for you. It is a well-organized and well-structured extensive presentation of all D3.js main features. They are presented in a progressive approach, and illustrated by 4 class projects that give you the opportunity to practice your skills.
Logistic Regression in Machine Learning
We can now have a look at another data analysis method, that allows to predict the probability for a new item to belong to one class or another. For instance, it could be a spam classifier that looks at some characteristics of the messages you receive and decides whether or not it is spam. It falls in the class of supervised classification algorithm, because it has to be trained on data (i.e. emails) that have already been labelled as spam or non-spam. This article presents the method, giving its mathematical foundation and its scikit-learn implementation. If you are not a math fan, at least read through the first part to get an intuition of how this algorithm work.
As a bonus, have a look at this video for a comparison between linear and logistic regression.
Which machine learning algorithm should I use?
In this learning path, we have presented various data analysis algorithms. These are also heavily used in Machine Learning. Of course, there are many more existing, targeted at different kinds of data, different types of output or analysis.
But how do you choose between those approaches? How do you know which one fits best for the task and data you are working on?
Have a look at this article, that makes a summary of the mainly used approaches, compares them and summarizes them into a useful cheat sheet that you can use to select the right algorithm.
How to Compare Machine Learning Models and Algorithms
A final important point we wanted to touch upon in this journey through data analytics and data science was the evaluation of the performance of an algorithm. Are there good or bad algorithm? What criteria can be taken into account to evaluate its quality or that of its outcome? The time required to analyse data? The amount of data to analyse before being able to make predictions or classification? How to interpret and deal with mis-classification or error in regression?
Those questions are addressed in this article, that also includes interactive resources to get a better understanding of this central question in data analysis: ‘how well am I doing’?