• Post category:StudyBullet-9
  • Reading time:7 mins read


How to transform a dataset for a machine learning model

What you will learn

How to fill the missings in numerical and categorical variables

How to encode the categorical variables

How to transform the numerical variables

How to scale the numerical variables

Principal Component Analysis and how to use it

How to apply oversampling using SMOTE

How to use several useful objects in scikit-learn library

Description

In this course, we are going to focus on pre-processing techniques for machine learning.

Pre-processing is the set of manipulations that transform a raw dataset to make it used by a machine learning model. It is necessary for making our data suitable for some machine learning models, to reduce the dimensionality, to better identify the relevant data, and to increase model performance. It’s the most important part of a machine learning pipeline and it’s strongly able to affect the success of a project. In fact, if we don’t feed a machine learning model with the correctly shaped data, it won’t work at all.

Sometimes, aspiring Data Scientists start studying neural networks and other complex models and forget to study how to manipulate a dataset in order to make it used by their algorithms. So, they fail in creating good models and only at the end they realize that good pre-processing would make them save a lot of time and increase the performance of their algorithms. So, handling pre-processing techniques is a very important skill. That’s why I have created an entire course that focuses only on data pre-processing.


Get Instant Notification of New Courses on our Telegram channel.


With this course, you are going to learn:

  1. Data cleaning
  2. Encoding of the categorical variables
  3. Transformation of the numerical features
  4. Scikit-learn Pipeline and ColumnTransformer objects
  5. Scaling of the numerical features
  6. Principal Component Analysis
  7. Filter-based feature selection
  8. Oversampling using SMOTE

All the examples will be given using Python programming language and its powerful scikit-learn library. The environment that will be used is Jupyter, which is a standard in the data science industry. All the sections of this course end with some practical exercises and the Jupyter notebooks are all downloadable.

English
language

Content

Introduction

Introduction to the course
Numerical and categorical variables
The dataset
Required Python packages
Jupyter notebooks

Data cleaning

Introduction to data cleaning
Selecting numerical and categorical variables
Cleaning the numerical features
Cleaning the categorical features
KNN blank filling
ColumnTransformer and make_column_selector
Exercises

Encoding of the categorical features

Introduction to the encoding of categorical variables
One-hot encoding
Ordinal encoding
Label encoding of the target variable
Exercise

Transformations of the numerical features

Introduction to transformations
Power Transformation
Binning
Binarizing
Applying an arbitrary transformation
Exercise
About power transformations

Pipelines

Define a transformation pipeline
Pipelines and ColumnTransformer together
Exercises

Scaling

Introduction to scaling
Normalization, Standardization, Robust scaling
Exercise

Principal Component Analysis

Introduction to PCA
How to perform PCA
Exercise

Filter-based feature selection

Introduction to feature selection
Numerical features, numerical target
Numerical features, categorical target
Categorical features, numerical target
Categorical features, categorical target
Feature importance according to a model
A comment on mutual information
A comment on feature selection with categorical variables
Exercises

A complete pipeline

An example of a complete pipeline

Oversampling

Introduction to SMOTE
How to perform SMOTE
Exercise

General guidelines

Practical suggestions