Clustering Explained

Shirsh Verma
AlmaBetter
Published in
4 min readApr 22, 2021

--

It comes under the gambit of Unsupervised learning- a branch of Machine learning mainly used for finding the pattern in data where the target variable is not known or yet to be discovered. This technique is usually applied before building a model for the problem statement. In Clustering the major task is to divide the population or data points into various groups such that data points in the same groups are more similar to other data points which are in the same group than those in other groups. In simple terminology, the objective is to segregate groups with similar characteristics and assign them into clusters. This technique is often used for locating interesting patterns in data, like groups of credit card user and their spending behavior.

For ex–

Clustering Variants

There are 2 types:

  • Hard Clustering: In this type of clustering, each data point either belongs to a cluster or it does not belong.
  • Soft Clustering: In this, instead of putting each data point into a separate cluster, a probability or likelihood of that data point is calculated concerning the clusters is assigned.

Algorithms which are use for implementing Clustering

  • Linkage: This algorithm is based on the assumption that the data points which are nearer in data space shows more similarity to each other than the data points lying farther away.It follows two approaches.First approach is that,it starts by classifying all data points into separate clusters & then aggregating them as the distance decreases. In the second time, all the data points are made to classified into a single cluster and then partitioning partitioned as the distance increases. Also, we can change the distance function as per our choice. These algorithm are very easy to interpret but have a poor performance on large datasets. Examples of these algorithm are hierarchical clustering algorithm and its variants.
  • Centroid: it’s an iterative clustering algorithm during which the concept of similarity springs from the closeness of a knowledge point to the centroid of the clusters.Examples of this algorithm is K-means. Here, the number of clusters required have to be mentioned before implementing it, which makes it necessary that we should have prior knowledge of the dataset. This algorithm runs iteratively to find the local convergence.
  • Distribution: In this algorithm primarily it is assumed that all data points in the cluster belong to the same distribution (ex: Normal, Gaussian). Models build on these algorithms are often prone to overfitting. Example of this is Expectation-maximization algorithm which uses multivariate normal distributions.
  • Density: It searches the data for areas with varying density of data points in the data space. It segregate different density regions and assign the info points within these regions within the same cluster. Examples of density based clustering are DBSCAN and OPTICS.

K-means clustering algorithm — It is the easiest unsupervised learning algorithm that solves clustering problem.Here we partition n observations into k clusters where each observation in the data space is allocated where it belongs to that cluster which has the nearest mean serving as a prototype of the cluster . It is use to determine the patterning, grouping among the unlabeled data points to estimate the number of features and target in the dataset as per the buisness need.Honestly There are not any criteria for an clustering. It depends on the , person’sstandard and how they’ll use which satisfy according to there need.In simple terms, we might wanted to find representatives for homogeneous groups (data reduction),then we find “natural clusters” and we describe their unknown properties (“natural” data types), find useful and suitable groupings (“useful” data classes) or find unusual data objects (outlier detection). This algorithm follows some assumptions that constitute the similarity of points and every assumption make different and equally valid clusters.

There are two types of hierarchical clustering methods:

  • Agglomerative Clustering
  • Divisive Clustering

Agglomerative Clustering:

It is a bottom-up approach, initially, each data point is itself a cluster, further pairs of clusters are clubbed as it moves up the hierarchy.

Divisive Clustering:

In this clustering algorithm we use a top-down clustering approach, initially, all the points in the dataset are marked to one specific cluster and split is performed repetitvely as we move down the hierarchy.

Difference between K Means and Hierarchical clustering

  • Hierarchical clustering performs badly on big datasets whereas K Means clustering performs well both on small and large datasets. This is because the time complexity of K Means is O(N) a linear function while for hierarchical clustering it is quadratic which O(N**2).
  • In K Means clustering, we start from a random clusters by taking a seed value, if we donot set the seed results are different for each run. These results are replicable in Hierarchical clustering.
  • K Means work’s well when the shape of the clusters is Spherical.
  • In K Means clustering we should have the knowledge of k or no. of clusters in which we want to divide our data into. But, we can stop at whatever number of clusters we find proper intrepretation of dendogram or as per the business need.

Applications of Clustering for solving variety of Business Problems

  • Recommender System: Netflix, YouTube
  • Customer Segmentation: Trends in Customers behaviours
  • Social network: Facebook
  • Medical
  • Insurance decisioning
  • Anomaly detection
  • Biology: Classification of different species in plants and animals

--

--