Dimensionality Reduction
Techniques that reduce the number of features in a dataset while preserving the most important patterns, making data easier to visualise, process, and model.
Dimensionality reduction is the process of reducing the number of features (variables, columns) in a dataset while retaining as much useful information as possible. It is essential when dealing with high-dimensional data β datasets with hundreds or thousands of features.
Why reduce dimensions
- Visualisation β humans can see two or three dimensions. Reducing a dataset to two dimensions lets you plot and visually inspect clusters, outliers, and patterns.
- Performance β many algorithms slow down dramatically or break entirely with too many features. Fewer dimensions means faster training and prediction.
- Noise reduction β some features contain more noise than signal. Removing them improves model performance.
- Curse of dimensionality β as dimensions increase, data becomes increasingly sparse. Models need exponentially more data to perform well in high-dimensional spaces.
Common techniques
- PCA (Principal Component Analysis) β finds the directions of maximum variance in the data and projects it onto those axes. The most widely used linear technique.
- t-SNE β a non-linear technique that excels at visualisation, preserving local structure. Popular for exploring clusters in high-dimensional data.
- UMAP β similar to t-SNE but faster and better at preserving global structure. Increasingly preferred for both visualisation and preprocessing.
- Autoencoders β neural networks that learn a compressed representation. More flexible than linear methods but harder to interpret.
Feature selection vs. feature extraction
- Feature selection keeps a subset of original features, discarding the rest. You can still interpret what each remaining feature means.
- Feature extraction (PCA, t-SNE, UMAP) creates new features that combine the originals. The new features are more compact but harder to interpret.
Practical considerations
Always apply dimensionality reduction on the training set and then apply the same transformation to the test set. Fitting on the test set causes data leakage and inflated performance estimates.
Why This Matters
Dimensionality reduction is a practical tool that makes AI projects feasible when you have wide datasets with many features. It improves model performance, reduces computing costs, and enables visual exploration of complex data. Understanding it helps you recognise when a project is struggling with too many features rather than too little data.
Related Terms
Continue learning in Advanced
This topic is covered in our lesson: How LLMs Actually Work