K-Means Clustering

Abhilash Singh
6 min readApr 5, 2021

Introduction

Clustering is an unsupervised learning algorithm and is the one of the most important method to getting the insight on the underlying nature and structure of the data.Clustering is defined as a process of grouping a set of data points/objects into a number of groups called clusters, such that the objects in the same cluster are similar, but are dissimilar to those present in other clusters. K-means is the more widely used clustering algorithm because of its easiness of interpreting the its results and implementation.It is primarily used in exploratory data mining, statistical data analysis, and in many fields including machine learning, pattern recognition, bioinformatics,among others.For example, in the figure below, we have a scatter plot of data points, where it can be clearly seen that the data points can be grouped separately into 3 different clusters, as shown in the figure, where each circle represent different clusters.

Partition-based Clustering

These methods are mostly distance-based and an initial partitioning of the data objects into k distinct clusters needs to be done and after this, an iterative relocation process is used to improve the partitioning by moving objects from one cluster to another based on their closeness (related) to other objects in a cluster. The data objects in different clusters are very different and are far apart. K-means clustering algorithm falls under this method of clustering.

K-Means

K-Means clustering is a simple and comely approach for partitioning the data into k distinct non overlapping clusters.For performing K-Means clustering we first specify the desired number of cluster k.

Here one question arises how to decide k??

We use elbo method for finding the k.In which we calculate the total cluster sum of square for different value of k and plot the graph between the within cluster sum of square and k,location of knee in the graph is generally consider as the indicator of the appropiate number of clusters.

we can observed from the above graph,k=4 is our optimal number of cluster because if we increase the number of cluster we don’t get appropiate changes in within cluster sum of square.

Again one question arise how to calculate WCSS(within cluster sum of square)??

Within cluster variation for the kth cluster is the sum of all of the pairwise squared Euclidean distances between the observation in the kth cluster, divide by the total number of observation in the kth cluster.
Let x1,x2,x3,x4,x5…xn be set of data point and m1, m2, m3…mk are
the cluster mean of c1 , c2, c3…ck clusters respectively .

Within cluster sum square for kth cluster

K-Means Method

The method starts either with a random partition of the data point into
k cluster or a set of randomly choosen set of k seed points to act as nucleus of the k cluster
Let x1,x2,x3,x4,x5…xn be set of data point and m1, m2, m3… mk are
the cluster mean c1 , c2, c3…ck cluster respectively .

Random Partition

  1. Random partion of n data point into k cluster
  2. Reassign the object if the objects is closer(Euclidean sense) to cluster
    center of another cluster then to its randomly assigned cluster.
  3. Recalculate cluster means of clusters losing an object and also for the
    clusters gaining the object.
  4. Continue step.2 and step.3 till no further reassignment is possible.

Random Seed Points

  1. Generate K-random initial seed point
  2. Walk throught the full data set and assign the n objects into k clus-
    ter,corresponding to those seed points.
  3. Reassign the object if the objects is closer(Euclidean sense) to cluster
    center of another cluster than to its randomly assigned cluster.
  4. Recalculate cluster means of clusters losing an object and also for the
    clusters gaining the object.
  5. Continue step.3 and step.4 till no further reassignment is possible.

Mathematical Explanation of K-Means Algorithm

  1. Total point Scatter within the cluster is written
Equation 1

Let denotes this is as equation(1)

where ̄xk is the mean vector associated with the kth cluster and Nk denotes the number of object in kth cluster .We assign the no of observation in such a way that in each cluster the average dissimiliarity(distance) of the observation from the cluster mean as default by point in that cluster is minimized.

let denotes it as eqution -(A)

can be obtained by noting that for any set of observation S

let denotes it as equation 2

we can obtain C* by solving the enlarged optimization problem

let denotes it as equation (3)

Define K-Means Algorithm using Above Equation

  1. For a given configuartion {C1, C2, …Ck} equation (3) is minimized
    with respect to {m1, m2, …mk}yeilding the means of the currently assigned cluster (2)
  2. Given current set of means {m1, m2, …mk} ,equation(3) is min-
    imized by assigning each observation to nearest centroid i.e

3.Repeat Step.1 and Step.2 untill no movement is possible.

Pros

  1. Fast and easy to understand
  2. Relatively efficient
  3. Gives the best result when data point of dataset are distinct

Cons

  1. Requires apriori specification of the number of clusters.
  2. It gives only local optima.
  3. Euclidean distance measures can unequally weight underlying factor.
  4. Hard assignment of the object to clusters.

Here one important question comes in mind how to check the performance of the K-Means clustering

One method is Calinski-Harabasz Index,this is also known as the Variance Ratio Criterion

It’s defined as the ratio between the within cluster dispersion and between the cluster dispersion . The higher the index ,better the performance

Where n denotes the total number of observations and k denotes the total number of cluster and tr(B_{k}) denotes the trace of between group dispersion matrix and tr(W_{k}) denotes the trace of within cluster dispersion matrix.

where B_{k} and W_{k} defined as

where n_{k} denotes the munber observation in kth cluster ,m_{k} denotes the centroid of the kth cluster ,m denotes centroid of the overall observation And K denotes number of cluster.

Conclusion

In this article we learnt,what is k-Means Clustering and how it’s algorithm works what is the pros, and cons, of this algorithm ,how to measure the performance of the k-Means.In the next article we will learn the application of k-means and how we use k_means in Python and also in R.

References

  1. An Introduction to Statistical learning by Gareth James ,Daniela Witten ,Trevor Hastie, Robert Tibshirani
  2. The Elements of Statistical Learning by Trevor Hastie,Robert Tibshirani,Jerome Friedman
  3. GeeksforGeeks

--

--