Take the following steps to configure a model:

1 Introduction
2 1. Create a Model Based on ProductSimilarity Model Class
3 2. Set the Scope of Products (Definition Step)
4 3. Set the Scope of Transactions (Definition Step)
5 4. Configure the Similarity Model (Definition Step)
6 5. Adjust Weights (Similarity Weighting Step)
7 6. Get Overview on Similarities (Product Similarity Step)
8 7. Explore Similarity per Product (Product Similarity Step)
9 8. Configure Similarity Grouping (Product Similarity Step)
10 9. Explore Similarity at Group level (Product Grouping Step)
11 10. Explore Products in Their Groups (Product Grouping Step)
12 Best Practices

Introduction

Product Similarity accelerator provides data on similarity between products (through a similarity score) and then groups products based on that score. The solution is powered by machine learning models. This documentation will guide you through setting up, configuring, and using this optimization model.

Main Features

Product Similarity Step – Uses a combination of product attributes in order to compute the similarity between products. Each type of attribute (textual, categorical, numerical) provides a score that can be weighted in order to give more importance to some attributes.
Product Grouping Step – Builds a connected graph based on computed similarities and detects product groups.

1. Create a Model Based on ProductSimilarity Model Class

Go to Optimization > Models (MO) and click the Add Model button at the top right.

Creation of a model

A pop-up is shown where you provide a name for your model, and the Model Class, which is ProductSimilarity, where you also have the possibility to configure which user groups can edit ProductSimilarity models and which user groups can view the details of computed ProductSimilarity models. Another Model Class would belong to another kind of optimization model.

Select model class

You can also duplicate an existing model. In this case, you will keep all the inputs of the previous model and you will have to rerun all the steps to get the outputs. Once you have copied a model, you can change its name by double-clicking the blank side of its name/label.

The same model class, ProductSimilarity, can be used by many models. Use informative names for your models, providing information on your dataset, and your calculation case.

The model you are configuring aims at identifying Products that appear similar in your business and regrouping them under data-driven labels, i.e. similarity groups. Similarity is estimated through product specific data, descriptions, hierarchical levels etc. and eventually transactional data. To process information related to similarity, two machine learning models will be implemented: the first one to manage textual descriptions, and the second one to regroup products in meaningful groups. You will have to set some parameters to make these algorithms works nicely on your data.

2. Set the Scope of Products (Definition Step)

This step is pivotal as it lays the foundation for your analysis. To ensure the accuracy and relevance of the similarity assessments, you need to define a clear scope for your products based on the following attributes.

Define the range of products you want to analyze:

Go to Product Data Scope tab under the Definition step.
Select your Product table with the correct name under Data Source Input (numerical, textual, or categorical).
Using filters to narrow down the scope of this analysis, you can easily exclude products based on every descriptor or combination of product’s descriptors you want.
- Product ID – This field is mandatory. Every product should have a unique identifier to differentiate it from others. This identifier can be a Product ID, SKU, or any other unique code.
- Text Attributes – This is any textual information of the product providing context or description. Common attributes include product names, descriptions, or other labels.
  Even if a product name is not mandatory, it is highly recommended for the rest of this analysis.
  These columns will be processed using sentence transformers to compute textual similarity.
  Texts over 255 characters will not be considered.
- Categorical Attributes – Categorical attributes are essential for classifying products into different categories or hierarchies. This data helps narrow down product pairs for similarity computations.
  To set it up, go to the Categorical Attributes section and choose columns that represent categorical data, such as product category, brand, colors, types, etc.
- Numerical Attributes – These attributes represent a variety of metrics, such as sales volume, weight, or dimensions. Depending on their nature, they can be summed up or averaged during the analysis.
  To set it up, go to the Numerical Attributes subsection and for each numerical column, choose whether you want it to be summed up or averaged during analysis.
- Price Delta Threshold – Products that have significantly different prices would probably make no sense to be consider similar even if the product names are similar (like a spare part mentioning the original product name), so you can select only products within a defined price range.
  - In the Price Delta Threshold section, enter a value. Products with price differences exceeding this value will be excluded from similarity computations. Periodically review the set price delta threshold for its relevance.
  - Click Apply Settings.
  - Review and confirm the uploaded data, check the selected columns to ensure they match your intent.
  - Click Continue to proceed to the next step or go to the next tab to go further in the configuration.
  Remember, clarity and correctness of this scope directly influence the subsequent steps and the accuracy of your results. Therefore, ensure you thoroughly understand your data and their attributes.

3. Set the Scope of Transactions (Definition Step)

Incorporating transactional data can add depth and richness to the analysis, allowing the system to gauge product similarity not only on inherent product attributes. That is probably where you can get an average selling price to define a range of acceptable price. Here is a detailed guide on how to integrate and structure this data:

Go to Transactions Scope in the Definition tab.
Select the transactional Data Source [DS] with relevant data (e.g., quantity, revenue, margin).
Using filters to narrow down the range of transactions you want to consider for your analysis can improve the quality of results and speed up the processing. Applying filters (for example, the last two years) can help in focusing on recent trends and making the analysis more relevant to current market conditions.
Select the mandatory Product ID (unique identifier) among the columns from your transactional data that correlates with the product's unique identifier. It is crucial for consistency that this matches the unique identifier chosen in Step 2.
Select Textual, Categorical and Numerical attributes the same way it has been described above.
Click Apply Settings.
Review and confirm the transactional data.

Recommendations

Data Consistency – Ensure that the unique product identifiers in the transactional data match exactly with those in the product data from Step 2. Any discrepancy can lead to missing or inaccurate insights.
Regularly Update Transactional Data – The more recent your data, the more relevant and actionable the insights. Schedule periodic updates to keep the system's analysis current.
Consider Seasonal Variations – If you are dealing with products that have seasonal variations in sales (e.g., winter clothes or summer accessories), consider this when choosing your date range (e.g. using last 12 months) and interpreting results.

4. Configure the Similarity Model (Definition Step)

Finetune the similarity model for best results:

Select the Model Configuration tab.
Select the right Text Transformer – English or Multilingual – for the content of the text columns selected in the Product Data Scope and Transaction Data Scope tabs. The text transformer is a crucial component when dealing with textual attributes. It is responsible for converting raw text into numerical vectors that can be used for similarity computations. Depending on the language and scope of your product descriptions, you have two choices.
Multilingual supports: Arabic, Chinese, Dutch, English, French, German, Italian, Korean, Polish, Portuguese, Russian, Spanish, Turkish.
The Refresh Text Transformer option allows you to skip some computations which can be long, if encoding was already done sooner and with the same scope. If some parameters (like other categorical or numerical attributes) are modified, you want to re-execute the model. Nevertheless, be careful that the products are the same, if in doubt, it is recommended to re-execute the text transformers by selecting this check box.
Set Maximum Number of Similar Products that are kept in next steps of the analysis. This defines the maximum number of similar products that will be retained for each product in the dataset. For instance, if you enter '10', for every product, the system will keep data on its top 10 most similar counterparts based on the computed scores.
Post transformation, the system will compute similarity scores between most of all possible pairs of products. However, for efficiency and clarity, you might want to limit the number of similar products that are kept for further, deeper analysis: a large number (more than 20) will make the computation longer and complexify the further grouping analysis; a too small number (below 5) will create a risk of producing less meaningful results.
Click Continue.

Recommendations

Know Your Text Data – Before selecting a text transformer, ensure you are familiar with the languages present in your product descriptions. While the multilingual transformer is powerful, using it unnecessarily can be computationally intensive.
Start with a Moderate Number – For Maximum Number of Similar Products, initially opt for a moderate number, 10 is a good start. Too few can miss out on relevant similarities, while too many can overwhelm the analysis. Once you are familiar with the system's outputs, you can adjust this number for subsequent runs.

5. Adjust Weights (Similarity Weighting Step)

In this step, the user has the opportunity to adjust the influence of each attribute type on the overall similarity score. The previous computation provides a similarity score up to 1 (perfectly matching) for each type of attribute:

Text Attributes – Text Similarity based on cosine distance on the transformer encoded vector.
Categorical Attributes – Categorical Similarity based on Hamming distance of the categorical values.
Numerical Attributes – Numerical Similarity based on Mahalanobis distance of the numerical values.

A weighted approach allows for flexibility in how product similarities are determined based on the importance and relevance of each attribute type. Every product’s attribute does not hold equal importance. Depending on the business context, some attributes may play a more significant role in determining product similarity than others.

Setting Attribute Weights

Default weights are displayed in the left menu for each type of attributes previously selected. Those weights can be adjusted and when you click Apply Settings, the dashboard on the right will reflect those parameters.

For instance, in most cases, text attributes provide more information and differentiate the products, so a higher weight is probably a good option.

For selected pairs of products, the system will calculate a Weighted Average Similarity (WAS) using the following formula:

Where:

wt is the weight for textual attributes.
St is the similarity score for textual attributes.
wc is the weight for categorical attributes.
Sc is the similarity score for categorical attributes.
wn is the weight for numerical attributes.
Sn is the similarity score for numerical attributes.

Setting the Similarity Threshold

To further refine the results,

Use the Maximum Similarity Threshold input field to set a value between 0 and 1.
Any product pairs with a weighted average similarity below this threshold will be ignored in subsequent analyses. This ensures you are focusing only on the most relevant and significant similarities.

Setting a too low threshold might end up with products that are considered similar but are not similar enough from a business point of view. Default value is 0.6 and we do not advice to have a threshold below 0.4.

Scope

Gives you insights on:

Total number of products
Total number of relationships
Number of relationships above Maximum Similarity Threshold

Visualize Similarity Distributions

Overall Similarity Summary Statistics can provide insights into how the three kinds of computed similarities are distributed across the dataset. These box plot charts show the distribution of values for textual, categorical, and numerical similarities. This visualization helps you understand the spread and central tendency of similarity scores, which can impact weight adjustments.

Explore Similar Products

Get a hands-on feel of the similarity results:

Navigate to the Explore Similarities section.
Select a product from the dropdown list. The system will display similar products based on the computed weighted average similarities.
For a deeper dive, explore the data tables Target Product Meta Data and Similar Co-Products Meta Data below which properties of the selected product and its similar counterparts are shown. This allows for side-by-side comparison and validates the quality of the similarity computations.

Recommendations

Iterative Approach – Start with equal weights and adjust based on the visual feedback from the box plots and product explorations. It is often helpful to iterate on weightings and thresholds multiple times to achieve optimal results.
Domain Knowledge – If available, involve a domain expert. They can provide insights which attributes are more critical in determining product similarity in your specific business context.
Limit Extremes – While it is possible, avoid setting any attribute weight to an absolute zero unless you are certain it holds no relevance. Even minor influences can sometimes provide valuable nuances in similarity computations.

6. Get Overview on Similarities (Product Similarity Step)

This step presents the results of the similarity computations in a comprehensive dashboard designed to provide users with a holistic view of the results, visualization aids, and relevant metrics that help in assessing the efficacy of the similarity computations.

Product Similarity Results

This section presents high-level metrics which give a snapshot of the similarity results.

Product Similarity Results:

Number of Products – Total count of unique products that were part of the similarity computation.
Average Number of Similar Products – Across all products, this metric presents the mean count of products that were deemed similar based on the user-defined threshold.
Average Similarity Score – Represents the average weighted similarity score across all product pairs that surpassed the threshold.

Parameters is a refresher section that reminds users about the parameters they set, ensuring transparency and easy revisits for iterations.

Maximum Number of Similar Products – This denotes the upper limit on the count of similar products retained for each product post-similarity computation.
Minimum Similarity Threshold – Indicates the cut-off similarity score below which product pairs were excluded from the results.

Similarity Scores Histogram

This bar chart provides insights into how the weighted average similarities are spread across product pairs.

Bar chart axes:

Horizontal Axis – Shows ranges of similarity scores.
Vertical Axis – Shows number of product pairs that fall within each similarity score range.

By looking at this chart, users can understand the concentration and distribution of similarity scores, which can be pivotal when deciding to revisit weights or thresholds.

Number of Similar Products

This histogram showcases how many products, on average, each product is similar to.

Histogram axes:

Horizontal Axis – Shows bins representing count of similar products.
Vertical Axis – Shows number of products that have a certain count of similar products.

It helps users visualize if most products have a lot of similar counterparts or if only a few products dominate the similarity landscape.

Similarity Output and Metadata

A detailed data table showcasing the pairs of products (a line is a relationship) that meet the similarity criteria.

Columns in the table:

ProductID – Identifier for the first product in the pair.
CoProductID – Identifier for the second product in the pair.
Similarity – Computed weighted average similarity score (WAS) for this product pair.
Rank – For a given ProductID, this is the rank of this relationship.
Others – Available attributes for Product and CoProduct.

By exploring this table, users can delve deep into individual product similarities and get help in validation or further exploration.

7. Explore Similarity per Product (Product Similarity Step)

It is time to narrow down from the broad perspective of all products to the intricate details of a single chosen product. This dashboard is centered around understanding the similarity landscape for one product and observing its relationships in the dataset.

First, select a product by its Product ID from the dropdown menu on the left.

Available Charts and Summaries

Product Overview

Here you will get a detailed snapshot of the selected product's similarity status within the dataset.

Product ID – Unique identifier for the chosen product.
Product Name – The name or descriptor of the product.
Number of Similar Products – The count of products that were deemed similar based on the user-defined threshold.
Average Similarity Score – The mean similarity score of the chosen product against its similar products.
Max Similarity Score – The highest similarity score achieved by the product against any of its counterparts.
Min Similarity Score – The lowest similarity score of the chosen product against its similar products.

Parameters

This section is a quick reminder of the user-defined parameters that influenced the similarity results.

Maximum Number of Similar Products – The upper limit set on the count of similar products retained for each product.
Minimum Similarity Threshold – The user-defined cut-off similarity score.

Similarity Histogram

This visual representation provides a clear view of how the similarity scores for the selected product are distributed against their counterparts.

Horizontal Axis – Shows ranges of similarity scores.
Vertical Axis – Shows number of products (similar to the chosen product) that fall within each similarity score range.

Similarity Graph

This graph displays the network of relationships around the selected product.

Nodes:
- Dark Green Node – Represents the chosen product.
- Light Green Nodes – Denote the products that are similar to the chosen product.
Edges (or connections):
- Connect the chosen product to each similar product.
- Existing edges between green nodes represent connections between similar products, indicating they are also considered similar to each other.

This graph helps in understanding the local density of relationships, which can be helpful for certain analyses, such as network analysis and product grouping. The more one is intricated relatively to other graphs, the more the overall similarity in this graph is meaningful.

Product Data

This is a detailed table showcasing the various attributes and properties of the selected product. Depending on the dataset, this can include columns like:

Product ID
Product Name
Category
Price
Description
And more...

Similar Products Data

This is a tabulated view detailing the attributes and properties of all products that are similar to the chosen product. This table provides side-by-side comparison capabilities, making it easier to validate and analyze the similarity results.

Columns in the table (example):

Product ID
Product Name
Category
Price
Similarity Score (against the chosen product)
And more...

8. Configure Similarity Grouping (Product Similarity Step)

The last tab of the Product Similarity step is about transitioning from identifying similar products to creating tangible groups of these products. The configuration in this step ensures the resulting groups are meaningful, manageable, and aligned with business needs.

Maximum Number of Products within a Group

This parameter controls the size of each similarity group. Depending on the business use case, users might prefer smaller, tightly-knit groups or larger, broader clusters.

Default Value is 150 products. This offers a balance between granularity and comprehensibility. The choice can be influenced by the business context or the total number of products in the dataset.

Additional Fields for Naming Groups

Group names are automatically generated based on frequency of words in text attribute defined under Product Name in the Definition step. You can further enrich that text with additional fields to make these names more meaningful, turning them typically into product categories. The dropdown menu contains text fields that were selected in step 1. You can select one or multiple fields from the dropdown menu. The values from these fields will be used to derive a name or label for each similarity group. For instance, if Category and Brand fields are chosen, a group might be named "Electronics - Samsung".

Please be careful to select fields that are unique per product, otherwise the naming will fail.

Generate Graphs for Visualization

Graphical representations can help in visualizing the formed groups and understanding the relationships within and between them. However, with extensive data, graph generation might be resource-intensive and time consuming.

Toggle Option (Yes/No) – Users decide whether they want the system to generate visual graphs for the similarity groups.
- Yes – The system will create and display graphical representations of groups.
- No – The system will skip this visualization and computation time for this step will be lower, especially for large datasets.

Show All Groups in Graphs at Initialization

Once graphs are generated, users can decide whether they want to view all groups right at the start or explore them one by one. Showing all groups simultaneously can be overwhelming with big data, but it provides a comprehensive overview.

Toggle Option (Yes/No):
- Yes – All similarity groups will be displayed on the graph as soon as it is generated.
- No – The graph will be initialized in a condensed view, and users can choose to expand and explore specific groups as needed.

Once everything is set up, click Continue to trigger the grouping process.

9. Explore Similarity at Group level (Product Grouping Step)

This last step “Product Grouping” provides outputs of the previous steps and a dashboard to review product similarity groups.

The first dashboard takes you deeper into the heart of the grouped products, illustrating the macro landscape of how products have been bucketed and how these buckets relate to one another.

Product Similarity Results

This portlet provides a high-level summary of the results post-grouping.

Number of Groups – Total count of similarity groups created.
Product Count – The total number of products that have been grouped.
Average Number of Product per Group – Provides the average size of each group. It is calculated by dividing the total Product Count by Number of Groups.
Grouping Quality – This value theoretically ranges from 0 to 1 and indicates the efficiency or accuracy of the grouping process (aka modularity). It is based on intra-group similarity (higher is better) and helps you in adjusting the settings. This can also be compared between models.

List of Similarity Groups

This is a tabular representation, scrollable if there are many groups.

GroupID – Unique identifier assigned to each group. This ensures that even if groups have similar names, they can be differentiated.
GroupName – The name or label of the group. This is formed based on the parameters set in Step 8, potentially utilizing additional fields for a descriptive name.

Overview of Similarity Groups

This is a graph representation that shows the interconnected landscape of product groups.

Nodes – Each node represents a similarity group.
- Size – Proportional to the number of products in the group.
- Label – The name of the group printed on top of or below the node.
- Color Coding – It is random.
Edges – Connections between nodes (groups).
- This shows that there are products in the two connected groups that are considered similar.
- The thickness of the edge represents the number of such connections. Thicker lines mean more products between the two groups are similar.
Tooltip – Hovering over a node provides additional details.
- Cardinality – Indicates the number of products in the group.
- GroupID – Unique identifier of the group.

10. Explore Products in Their Groups (Product Grouping Step)

This “similarity journey” finishes with a colorful firework style graph. It allows you to deep dive into the intricate web of product relationships and how these relationships translate into groups.

Group Selection

On the left side in Display Products you will find:

Similarity Groups – Contains a list of all group names. Users can multi-select the groups they are interested in.
Apply Settings – After making selection, users need to click this button. The dashboard will then refresh, displaying only products from the chosen groups.

Available Widgets

Product Similarity Results

This widget provides an overview and sets the stage for the exploration that will follow.

Displayed Products – The total number of products that are currently displayed on the graph, based on the user's group selection.
Displayed Groups – The number of unique groups that the displayed products belong to.
Unique Groups Names – The number of unique names of the groups currently displayed. This may differ from Displayed Groups if at least two groups received the same name (that can happen if used textual product descriptors have a weak vocabulary).
Total Number of Groups – This remains a constant, representing the total groups formed, helping to contextualize the current view in relation to the entire dataset.
Total Number of Products – Another constant, representing the total products in the dataset. This gives the user a sense of scale and perspective.

Product Details

An essential table that provides the user with a detailed breakdown of individual products and their affiliations.

ProductID – Unique identifier for each product.
ProductName – If set, this provides the name or label of the product, helping with immediate recognition.
Coordinates in the Graph – X and Y coordinates indicating the product's location in the visual graph. This is particularly useful if users wish to correlate table entries with their visual representations. Coordinates are precomputed, so more data points can be displayed.
GroupID – The unique identifier of the group the product has been assigned to.
GroupName – The name or label of the associated group.

For users' convenience, features like searching, sorting, and filtering can be integrated within this table.

Products and Groups Graph

The main representation of products and their connections.

Nodes – Each node represents an individual product.
- Color Coding – The color of a node indicates the group it belongs to. This helps in visually clustering similar products.
Edges – Thin lines connecting products, representing the similarity between two products.
Tooltip – Hovering over a product node gives a concise overview.
- ProductName – The name of the product.
- ProductID – Its unique identifier.
- GroupName – The name of the group it belongs to.
- GroupID – Unique identifier of the group.

Zooming, panning, and selecting multiple nodes will provide a deeper information about product and groups interconnections.

This chart requires a lot of resources to be displayed and may fail when too many products are displayed, especially if there are more than 20 000 products to display.

After completing step 10, you will have a comprehensive view of the product landscape, understanding both group-level and individual product-level relationships. This holistic insight can drive better decision-making, foster innovation, and unearth hidden patterns.

Best Practices

Regularly update data – For best results, ensure your product and transaction tables are updated frequently.
Monitor weights – Over time, the importance of certain similarity metrics might change. Periodically review and adjust these.
Feedback loop – After grouping, review product placements in groups. If mismatches are found, adjust model settings and re-run.
Regular data audits – Given the complexity and the multifaceted nature of the data being analyzed, it is recommended to periodically audit your data for consistency and correctness. For instance:
- Ensure the unique identifiers remain unique and consistent across updates.
- Validate that text attributes still make sense and remain relevant.
- For categorical attributes, ensure that the categories are still valid and that no new ones need to be added.
- Re-assess the validity and relevance of numerical attributes.
- Periodically review the set price delta threshold for its relevance.
Data Quality and Cleansing – Before starting with the product similarity analysis, always ensure that your data is clean. This includes:
- Removing duplicates.
- Handling missing values.
- Correcting any inconsistencies or inaccuracies in the data.
  High-quality data will lead to more accurate and reliable similarity results.

Troubleshooting

Missing data – Ensure all uploaded tables have complete data. Missing values can impact the quality of results.
Performance issues – For large datasets, the analysis might take longer. Consider filtering the scope or optimizing your data for faster results.
Incorrect grouping – If groups do not appear accurate, consider adjusting the similarity model settings or the grouping threshold.

Conclusion

Product Similarity is a comprehensive tool designed to streamline the process of identifying and grouping similar products in your business. By following this guide and understanding each step, you can maximize the benefits of this platform for your business needs.