Clustering Metrics (Clustering)

Visual representation of the clustering metrics:

Clustering Quality

Clustering Quality is calculated from the Silhouette Score:

  • A higher value is better.

  • The worst is -1, clusters are assigned in the wrong way.

  • 0 is still bad, the distance between clusters is not significant

  • 1 is perfect, clusters are perfectly apart from each other and clearly distinguished.

The Silhouette Score is defined by the s = mean((b−a)/max(a,b)) where:

  • a – The mean distance between a sample and all other points in the same class.

  • b – The mean distance between a sample and all other points in the next nearest cluster.

Variance Ratio Criterion

Variance Ratio Criterion is calculated from the Calinski-Harabasz Index and represents the ratio of between-clusters dispersion and within-cluster dispersion.

  • A higher value is better.

  • The worst is 0.

  • The number of samples matters, so the value will change a lot between two datasets.

  • There is no perfect value.

The Calinski-Harabasz index is the ratio of the sum of between-clusters dispersion and of within-cluster dispersion for all clusters (where dispersion is defined as the sum of distances squared).

Cluster Separation

Cluster Separation is calculated from the Davies-Bouldin Index as:

"Cluster separation" = 1 / (1 + Davies-Bouldin Index)

  • A higher value is better.

  • A perfect value is 1.

The Davies-Bouldin Index relates to a model with better separation between the clusters.

It is computed as the average similarity measure of each cluster with its most similar cluster, where similarity is the ratio of within-cluster distances to between-cluster distances. Thus, clusters which are farther apart and less dispersed will result in a better score.

Combined Score

Coming from all the scores combined together after scaling each score for clarity:

  • Scaled Clustering Quality = Clustering Quality / Max(“Clustering Quality” for all numbers of clusters)

  • Scaled Variance Ratio Criterion = Variance Ratio Criterion / Max(“Variance Ratio Criterion” for all numbers of clusters)

  • Scaled Cluster Separation = “Cluster Separation” / Max(“Cluster Separation” for all number of clusters)

Combined Score = “Scaled Clustering Quality” * “Scaled Clustering Quality” * “Scaled Variance Ratio Criterion” * “Scaled Cluster Separation”