Clustering
An unsupervised machine learning technique that groups similar data points together without predefined labels, revealing natural patterns in data.
Clustering is a type of unsupervised machine learning that automatically groups similar data points together. Unlike classification, where you tell the model what categories exist, clustering discovers the categories on its own by finding natural patterns in the data.
How clustering works
Clustering algorithms measure similarity between data points β typically using distance metrics β and group the most similar points together. The algorithm does not know what the groups represent; it only knows which data points resemble each other.
Common clustering algorithms
- K-means β you specify the number of clusters (K), and the algorithm iteratively assigns each point to the nearest cluster centre. Simple, fast, and widely used.
- Hierarchical clustering β builds a tree of nested clusters, allowing you to choose the level of granularity. Useful when you do not know how many clusters to expect.
- DBSCAN β identifies clusters based on density, handling irregular shapes and automatically detecting outliers. Good for spatial data.
- Gaussian Mixture Models β assumes data comes from a mix of probability distributions, allowing soft cluster assignments where a point can partially belong to multiple clusters.
Business applications
- Customer segmentation β grouping customers by purchasing behaviour, engagement patterns, or demographics
- Anomaly detection β identifying data points that do not fit any cluster, flagging potential fraud or errors
- Document organisation β automatically grouping similar documents, emails, or support tickets
- Market research β discovering natural segments in survey data without imposing predefined categories
Challenges
- Choosing the right number of clusters is often subjective
- Results depend heavily on how similarity is measured and which features are included
- Clusters must be interpreted by humans β the algorithm groups data but does not explain why
- High-dimensional data can make distance metrics unreliable
Why This Matters
Clustering reveals patterns in your data that you might never find manually. Customer segmentation, anomaly detection, and content organisation are high-value use cases accessible to most organisations. Understanding clustering helps you identify opportunities where unsupervised learning can deliver business insights without the cost of labelled training data.
Related Terms
Continue learning in Practitioner
This topic is covered in our lesson: Building Your First AI Workflow