Data Requirements (Clustering)

Clustering operates on a Transactions Datamart or possibly Datasource that has to contain the correct data. Four columns at least have to be prepared for their expected role:

  • Revenue – Must be a numerical type or money type column without missing or negative values in order to have correct values displayed in the results.

  • Target – This is the metric that will be used for the clustering itself. It must be a numerical type or money type column without missing values for the normal functioning of the clustering process. Target and revenue may point toward the same column in the Datamart, for example if Spent Pattern is used as metric.

  • Group – Defines the dimension that will be clustered, e.g. if you want to regroup customers into groups.
    Must be a dimension type column without missing values. For such cases, it is recommended to handle it in advance: replace missing data by a chain of characters without a space, e.g. “unknown” or “missing-data” . At least 2 different groups have to be present in the dataset.

  • Based On – Defines the dimensions that will be used to compare the groups. It must be a dimension type column without missing values. At least 2 different values of Based on have to be present in the dataset. Also the dimension for Based on should be different than the one for Group and cannot be part of the same hierarchy (to be specific, those two dimensions cannot be collinear).
    It is advised to handle missing values in advance: replace missing data by a chain of characters without a space, e.g. “unknown” or “missing-data”.