Data Science

Learn Everything About Dimensionality Reduction

Curse of Dimensionality

The training data which is fed to a Machine Learning algorithm for predictive modeling contains columns and rows which act as an input to the model, these inputs are also called features or attributes and they represent the dimensionality of the dataset whereas the records are data points in the n-dimensional feature space.

As the dimensions of any dataset grow, it becomes harder and harder to visualize and understand the overall data, making the common data transformation techniques difficult to implement. A complex dataset could consist of hundreds, thousands, or even millions of dimensions. The computational cost and time required to train the model increases with the increase in dimension, which also often leads to overfitting while model building which means the model will perform well on the train set but not on the test set. Hence also called as ‘Curse of Dimensionality’.

What is Dimensionality Reduction?

Dimensionality reduction refers to the technique which helps us to bring the data points from high-dimensional space to lower-dimensional. It is the process of removing unnecessary features and only keeping the few ones while trying to preserve most of the relevant information that is truly contributing to the model output resulting in good accuracy.

When there are a lot of correlated features using this technique helps remove redundant data without any major information loss. This can be performed after Data processing & Data preparation and before ML modeling and the same needs to be done for test data before making any predictions.

Approach to Dimensionality Reduction

Projection

In many real-world datasets, only a few of the column values are changed for different output values and most of the other columns are constant so in a lower-dimensional space they lie very close to each other.

The figure on the left is representing 3D data points and we’re projecting it perpendicularly on a 2-dimensional XY plane, the X and Y coordinates of the projected points are the new data points in 2 dimensions represented by the figure on the right.

So, when we try to project them perpendicularly to the lower dimensional plane, we can easily bring them from higher dimensional space to lower-dimensional space. But this approach doesn’t work with all types of the spread of data, you need to check with this before projecting them for reducing the number of dimensions.

2. Manifold Learning

Different techniques such as Principal Component Analysis, Linear Discriminant Analysis, etc are good when it comes to linear projection of data, but they miss the non-linear structure of the data which might have some interesting patterns or insights to look upon.

Swiss Roll Dataset (containing non-linear structure)

Manifold Learning is a way to make algorithms such as Principal Component Analysis sensitive to non-linear structure in data. Typically manifold learning is unsupervised i.e it learns from the high dimensional structure data itself without any predetermined classification.

Techniques for Dimensionality Reduction

The most common reduction technique is Feature Selection, which uses scoring or statistical methods to select which features to keep and which ones to discard.

1. Feature Selection

In this, a small subset of features that are useful are chosen from all the features available and other remaining redundant columns are removed, based on the performance it’s decided if we should keep it or not. There are different methods for Feature Selection which are listed below:

Filter Methods - Here different filters are used to select the columns to be used. Multiple subsets of the original set of columns are made, in the end, only optimal ones are used for further modeling.
Wrapper Methods - It works similar to the filter method but here model performance is used to evaluate the feature selection process. It gives better results as compared to the filter method, but due to building and evaluating multiple models, the process becomes more complex.
Embedded Methods - Embedded methods focus on the iterations of the model training, it observes how accuracy and error are changing with respect to the number of iterations the model has been trained for, and depending on the feature importance calculated after model training, we understand more clearly if some feature is contributing or it is redundant.

2. Feature Extraction

Techniques from linear algebra can also be used for reduction purposes, another common technique is called Principal Component Analysis (PCA).

Principal Component Analysis

PCA is a dimensionality reduction technique often used to reduce the dimensions of large datasets, by transforming a large set of variables into a smaller one. Reducing the number of dimensions comes at the cost of accuracy, but the trick in dimensionality reduction is to trade a little accuracy for simplicity.

The basic idea behind PCA is to reduce the number of variables in a dataset while preserving as much information as possible. Below are the steps involved when performing dimensionality reduction using PCA:

1. Standardization – If the range of one variable is larger than the other variable, it might happen that variables with larger range dominate over the other. This step aims to standardize the range of continuous initial variables so that each one of them contributes equally to the analysis, which will not lead to biased results.

Mathematically this can be done by subtracting mean from each data point and dividing it by standard deviation.

2. Covariance matrix – Calculating the covariance matrix helps us to see if there is any relationship between different variables. Sometimes they are highly correlated in such a way that they contain redundant information.

For an n-dimensional dataset, the covariance matrix is nxn symmetric matrix, where the elements above and below the main diagonal are the same and one along the main diagonal.

For a positive value, one variable increases with the increase in another, whereas for a negative value it decreases with the increase in another.

3. Calculate the Eigenvectors and Eigenvalues of the covariance matrix to identify the principal components - Principal components are new variables that are constructed as linear combinations or mixtures of the initial variables.

These combinations are done in such a way that the new variables (i.e., principal components) are uncorrelated and most of the information within the initial variables is squeezed or compressed into the first components.

4. Feature Vector - In this step, we choose whether to keep all these principal components or discard those of lesser significance (of low eigenvalues), and form with the remaining ones a matrix of vectors that we call Feature vector.

Feature vector is just a matrix which has eigenvectors of components we decide to keep as columns. Like out of b eigenvectors if we only keep a eigenvectors, the final data would only have a dimensions.

5. Recast the data along the principal component axes – This step aims to use the feature vector formed using the eigenvectors of the covariance matrix, to reorient the data from the original axes to the ones represented by the principal components (hence the name Principal Components Analysis).

Advantages of Dimensionality Reduction

1. Removing the irrelevant columns from the dataset makes the process of data transformation and visualization easy.

2. Less number of features means reduced computational cost and time.

Disadvantages of Dimensionality Reduction

1. Sometimes while performing such techniques there is a loss of information.

2. There is no pre-defined number of principal components to be used, a thumbrule can be applied here.

Aniket Jalasakare
Sep, 18 2022