Manqiu L. | Clustering Algorithms and Evaluations

KMeans

What is KMeans? source

It assigns data points to a cluster such that the sum of the squared distance between the data points and the cluster’s centroid (arithmetic mean of all the data points that belong to that cluster) is at the minimum

K: For this algorithm, we pre-defined that there are K clusters, or to say, K centers.
Means: We calculate our centers by calculating means of each cluster.

Process: Visualization

Guess some cluster centers
Repeat until converged
- assign points to the nearest cluster center
- set the cluster centers to the mean

Wait..But how does this work? source

The approach kmeans follows to solve the problem is called Expectation-Maximization. The E-step is assigning the data points to the closest cluster. The M-step is computing the centroid of each cluster.

The objective function:

Where $w_{ik}=1$ for data point $x_i$ if it belongs to cluster $k$; otherwise, $w_{ik}=0$. Also, $\mu_k$ is the centroid of $x_i$’s cluster.

We first minimize $J$ w.r.t. $w_{ik}$ and treat $\mu_k$ fixed. Then we minimize $J$ w.r.t. $\mu_k$ and treat $w_{ik}$ fixed.

When $\mu_k$ is fixed, we assign each point to its closest center to reach minimum.

When $w_{ik}$ is fixed, we recalculate the $\mu_k$ by calculating each cluster’s mean.

How to choose K?

Elbow Method:

We pick k at the spot where SSE starts to flatten out and forming an elbow.

Limitation:

Requires pre-knowledge towards the number of clusters.
Is limited to linear cluster boundaries
The globally optimal result may not be achieved
k-means can be slow for large numbers of samples or large number of features

DBSCAN (Density-Based)

What is DBSCAN?source

DBSCAN: Density-based spatial clustering of applications with noise.
Idea: Clusters are dense regions in the data space, separated by regions of lower object density.
Method: Groups together points that are close to each other based on a distance measurement (usually Euclidean distance) and a minimum number of points. It also marks as outliers the points that are in low-density regions.

Parameters:

eps: specifies how close points should be to each other to be considered a part of a cluster.
minPoints: the minimum number of points to form a dense region.

How this works?Visualization

Visit every point(every point is visited once only), check if this point satisfies “the number of points in the circle with radius of eps is greater or equal to minPoints”.
- If Yes: check if this point is already in some cluster. If yes, this point and all the neighbors are marked as in this cluster; If No, then create a new cluster. Then visit all the neighbors and check the condition again.
- If No: mark this point as outlier.

How to choose parameters?

MinPts: As a rule of thumb, a minimum minPts can be derived from the number of dimensions D in the data set, as minPts ≥ D + 1
The value for $\epsilon$ can then be chosen by using a k-distance graph, plotting the distance to the k = minPts-1 nearest neighbor ordered from the largest to the smallest value. Good values of $\epsilon$ are where this plot shows an “elbow”.

Advantage:

No need to specify number of clusters in advance
Can handle non-regular shapes of clusters.

Limitation:

Sensitive to parameters
Cannot handle varying densities
Not entirely deterministic: border points that are reachable from more than one cluster can be part of either cluster
Choose parameters can be hard.
Curse of dimensionality

Hierarchical Clustering

Method of cluster analysis which seeks to build a hierarchy of clusters

Type:

Agglomerative: This is a “bottom-up” approach: each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy.
Divisive: This is a “top-down” approach: all observations start in one cluster, and splits are performed recursively as one moves down the hierarchy.

How?

Initially, every point is a seperate cluster
Iterate until all points in one cluster:
- identify the two clusters that are closest together
- merge the two most similar clusters

Distance:

Usually Euclidean distance

Linkage:

Gaussian Mixture Models

Clustering Evaluation

Why evaluation? : to compare different algorithms, to choose best parameters. source

External measures for clustering evaluation

External: assume that ground truth is known.

Matching-based measures

Purity: Quantifies the extent that cluster Ci contains points only from one (ground truth) partition

Drawback of purity: two clusters may be matched to the same partition.

Maximum Matching: the maximum purity under the one-to-one matching constraint.

F-Measure

Let’s review

Entropy-based measures

Conditional Entropy

Multual Information

Pairwise measures

Internal measures for clustering evaluation

Internal: No ground truth. Intra-cluster datapoints to be as close as possible to each other and inter-clusters to be as far as possible from each other

Silhouette Coefficient

Davies-Bouldin Index

Graph-based measures

Normalized Cut
Beta-CV Measure