Clustering Metrics (Clustering)
Visual representation of the clustering metrics:
Clustering Quality
Clustering Quality is calculated from the Silhouette Score:
A higher value is better.
The worst is -1, clusters are assigned in the wrong way.
0 is still bad, the distance between clusters is not significant
1 is perfect, clusters are perfectly apart from each other and clearly distinguished.
The Silhouette Score is defined by the s = mean((b−a)/max(a,b))
where:
a
– The mean distance between a sample and all other points in the same class.b
– The mean distance between a sample and all other points in the next nearest cluster.
Variance Ratio Criterion
Variance Ratio Criterion is calculated from the Calinski-Harabasz Index and represents the ratio of between-clusters dispersion and within-cluster dispersion.
A higher value is better.
The worst is 0.
The number of samples matters, so the value will change a lot between two datasets.
There is no perfect value.
The Calinski-Harabasz index is the ratio of the sum of between-clusters dispersion and of within-cluster dispersion for all clusters (where dispersion is defined as the sum of distances squared).
Cluster Separation
Cluster Separation is calculated from the Davies-Bouldin Index as:
"Cluster separation" = 1 / (1 + Davies-Bouldin Index)
A higher value is better.
A perfect value is 1.
The Davies-Bouldin Index relates to a model with better separation between the clusters.
It is computed as the average similarity measure of each cluster with its most similar cluster, where similarity is the ratio of within-cluster distances to between-cluster distances. Thus, clusters which are farther apart and less dispersed will result in a better score.
Combined Score
Coming from all the scores combined together after scaling each score for clarity:
Scaled Clustering Quality = Clustering Quality / Max(“Clustering Quality” for all numbers of clusters)
Scaled Variance Ratio Criterion = Variance Ratio Criterion / Max(“Variance Ratio Criterion” for all numbers of clusters)
Scaled Cluster Separation = “Cluster Separation” / Max(“Cluster Separation” for all number of clusters)
Combined Score = “Scaled Clustering Quality” * “Scaled Clustering Quality” * “Scaled Variance Ratio Criterion” * “Scaled Cluster Separation”