Usage (Clustering)
Take the following steps to configure a model:
1. Create a Model based on Clustering Model Class
Go to Optimization > Models (MO) and click the Add Model button at the top right.
A pop-up is shown where you provide a name for your model, and the Model Class, which is Clustering. Another Model Class would belong to another kind of optimization model.
You can also duplicate an existing model. In this case, you will keep all the inputs of the previous model and you will have to rerun all the steps to get the outputs. Once you have copied a model, you can change its name by double-clicking the blank side of its name/label.
Remember to do it before running the model. You cannot change a model name once it has been computed.
The same model class, Clustering, can be used by many models. Use informative names for your models, providing information on your dataset, and your calculation case.
2. Set the Scope of Transactions (Definition Step)
In the Definition step, you map the inputs and set the scope of the model. The user inputs are always on the left. Refer also to Data Requirements (Clustering).
Source – Datamart (typically transactions Datamart) used to perform clustering. It must fulfill the requirements listed in Installation (Clustering). Once provided, some fields based on it appear:
Metric – Select the type of metric that will be the basis of comparison of the items to be clustered. The clustering approach will group together items with similar values.
Spend Pattern computes the share of revenue (set in the Revenue field) spent across categories (typically product category). A typical use case is profiling the customers based on their consumption, intending to group together customer purchasing similar types of products.
Average takes the average of the target
Median takes the median of the target.
Sum takes the sum of the target.
Target – Defines the attribute used by the metric computation (Spend pattern or Average/Median/Sum).
For the Spend Pattern metric, revenue shall be defined as the target, so the share of revenue are compared to define the clusters.
For Average/Median, discount rate or margin rate can be used for example. So the clustering will group together based on a similar pattern of discount rate or margin rate.
For Sum, a metric that can be summed can be used, such as absolute value of profit (in that case clustering will be based on the pattern of the generated profit for each case of “group” and “based on”).
Group – Defines the attribute intended to be grouped together, e.g. customers to be grouped in customer clusters. It is the first of the two dimensions used to aggregate the data.
Based On – Defines the attribute to which the metric will be compared to. This is the second of the two dimensions used to aggregate the data.
Attributes for “Group” and “Based On” cannot be part of the same hierarchy, e.g. Product Category and Product Sub-Category. In that case the clustering process will fail as no pattern can be found.Revenue – Provides the transaction revenue, which will be the basis of some analysis. It may be a net price, gross price, or another, depending on the properties you want to explore.
Additional Features – Some non-mandatory fields you may want to keep in the data for a further filtering during the step Model Configuration or for making the result review easier.
Advanced Filter – Allows you to filter the meaningless data, using e.g. time frame (last 12 months is a good start). Nevertheless, use these filters wisely: even though it is possible to make this model execute with a minimum number of 7 “Group” items, the clustering result might not be so relevant. As a rule of thumb, the number of items should exceed 10 times the expected number of clusters.
We recommend that you define at least these filters for data cleanliness and thus avoid errors later on:Revenue > 0
Quantity > 0
Removing null values for “Group” and “Based On”
Once you apply the settings, the right panel provides:
Data selection sample – Filtered out data that will be the scope of the clustering.
3. Data Overview and Setting Clustering Parameters
(Model Configuration Step)
Opening the Model Configuration step triggers an initial calculation run to prepare the data and provide the user with some insights about items to group and based-on items. Some Pareto distribution diagrams can help the user in setting a threshold to ignore e.g. some very small customers or products that may degrade the clustering process (including less relevant data) and increase computation time.
The user inputs on the left allow you to tune both the clustering process and post-treatment labelling. The clustering process is based on a non-supervised hierarchical clustering algorithm that produces a tree based on the distances between groups. Once this tree is computed, it is possible to select the level of the tree that will be kept as the clustering threshold. Close to the root of the tree means few low discriminant clusters, while close to the leaves means many high discriminant clusters.
Minimum Number of Clusters – Minimum number of clusters among which clusters will be suggested.
Maximum Number of Clusters – Maximum number of clusters among which clusters will be suggested.
Minimum Revenue in a Clusters – Sets the minimum total revenue a cluster should reach. This would remove clusters representing a low total revenue. Any item initially assigned to such a small cluster will then be reassigned to the closest cluster.
Percent of GroupBys to process – Represents a threshold that will keep in the initial clustering process the group items that represent together x% of the total revenue. This is a high pass filter that may e.g. remove one-shot customers from the analysis. Remember to keep enough “GroupBys” so that they exceed 10 times the expected number of clusters.
Percent of BasedOns to process – Represents a threshold that will keep in the initial clustering process the based-on items that represent together x% of the total revenue. This is a high pass filter that may e.g. remove long tail products from the analysis.
Expense percent threshold – Affects the values in the group * based-on matrix in the Expense Pattern analysis: the default value of 1 means that when a group has expended less than 1% of its revenue on a given based-on, this expense is nullified to focus the comparison of groups on the main based-on they are linked to.
Linkage method – Defines the way the clusters are aggregated together for the final outputs. The options to select from are: single, complete, average, weighted, centroid, ward (which is by default a good choice). See Linkage method for further details.
Name Prefix Clusters – Non-mandatory string that will be appended as a prefix to the automatically generated names of the clusters. Automatic naming is a summary of the 3 main based-ons that are present in a cluster.
Cluster affectation – Is either Basic or Extended:
Basic – Only clusters the group items selected for clustering after reaching the thresholds defined above.
Extended (default) – All group items receive a cluster label; the ones that were not part of the initial clustering process are assigned to existing clusters in the second pass based on their proximity to already clustered group items.
Show detailed heatmap in result – If set to Yes, this parameter visualizes all or a sample of the group and how the groups are put together in clusters.
Suffix for Data Source – String that is added to customize the name of the final exported Data Source.
In order to apply any changed parameter, it is necessary to click the Apply Setting button at the bottom left of the panel.
This button will also update the right panel values to provide an estimation of the eventual filtered out group items and based-on items.
When all parameters are correctly set, click the top right Continue button to trigger the clustering process.
4. Explore Clustering Results (Result Step)
When you arrive at the Result step from the Configuration step, the model runs a calculation that can take some minutes, depending on the size of the data and the number of groups and based-ons.
Once the calculation has run, three tabs appear:
Overview – Exposes the best clustering occurrence that respects the user’s settings.
Details (groupBy) – Lists all the groups and how they have been assigned to a given cluster.
Number of clusters – Shows many clustering alternatives based on a single linkage matrix.
4.1 Overview
This tab has 6 widgets:
Clusters
List of clusters with their name (ClusterName) that describes the main composition of the cluster, number of groups (Nb_+”name of column of the group”), number of transactions (Nb_Transactions), and the total expense in the cluster (TotalExpense). NumberOfSizes, most often set to 5, shows the number of quintiles that are computable inside a cluster. If is is less than 5, it means that some groups are relatively very big compared to others in this cluster. TotalExpenseRatio is the percent of expense/revenue content in a cluster.
Clustering Numbers
Some high-level information about the current clustering solution.
Clustering Metrics
For description of the 3 metrics used for comparing clustering solutions see Clustering Metrics (Clustering).
Relative Expenses per Cluster
Heatmaps that present the average behavior of each cluster according to the based-on dimensions. The data are normalized between 0 and 1.
Details of Clusters
Information on the clustered groups, such as the cluster the group belongs to, the relative size of this group inside the cluster (A = big, E = small), the rank of the group for a cumulative revenue point of view, the rank of the group for a number of transaction point of view etc.
Visualization of Clusters on Virtual Axis (Computed by PCA)
This is a way to visualize all the values how they have been assigned to a cluster, with a two dimensional plot that has been generated for that purpose. Axes are produced by a reduction of dimensionality of all the based-ons and the two axes of the plot are the two main axes to split apart all the data. The percentage of explained variance computed on the group x based-on matrix is displayed for each axis and in the subtitle. Groups are colored according to the cluster label they have received.
4.2. Details (groupBy)
Details of Clusters
Information on the clustered groups, such as the cluster the group belongs to, the relative size of this group inside the cluster (A = big, E = small), the rank of the group for a cumulative revenue point of view, the rank of the group for a number of transaction point of view etc.
Details of Target Metric
Indexed matrix group x based-on.
Relative Expense Per Group-Bys
Heatmap made from indexed group x based-on matrices. This rendering requires a lot of resources to be shown properly, hence some sampling of the groups can be done. This visualization presents similar groups and their proximity.
If the heatmap happens to have too many cells, only a subset is displayed.
4.3. Number of Clusters
Clustering Absolute Metrics (Higher Is Better)
Raw metrics values (non scaled) can be useful for the comparison of the results of different models.
Further details about the score: Clustering Metrics (Clustering)
Scaled Scores by Number of Clusters
This is a visualization of the metrics across all the clustering options offered by the hierarchical tree built behind the scene and before any threshold. It is helpful to assess if the range of the min-max numbers of clusters set in the Configuration step is well defined or should be adjusted. Nevertheless, you should prefer a reasonable number of clusters rather than some good metrics and an impractical number of clusters.
Clustering Alternatives
This table contains high level data about the different clustering solutions of the same hierarchical tree, including all scores for all number of clusters.
5. Export Clustering to Data Source
Clicking the Continue button from the Result step triggers the export of the model table clusterd_groupbys to a Data Source that will be named:
CLUST_[name of group]_[name of based-on]_[name of target]_[optional suffix]
All names are limited in length to 7 characters, and if a Data Source already exists with the same name, this old Data Source will be overridden.
Additional information
All computation results are stored as tables of the model. These tables can be accessed through the menu in the top right corner. But usually you do not need to access them this way; all needed information is directly provided in the three result sections of the model.