Product Similarity Metrics (Optimization - Product Similarity)

Product Similarity Accelerator leverages advanced metrics to identify similarities between products. Understanding these metrics will bring you a clear view on how product similarities are computed.

1 1. Sentence Transformers & Cosine Similarity (Textual Attributes)
2 2. Hamming Similarity (Categorical Attributes)
3 3. Mahalanobis Similarity (Numerical Attributes)
4 5. Weighted Average Similarity
5 6. Graph Construction & Community Detection

1. Sentence Transformers & Cosine Similarity (Textual Attributes)

Sentence Transformers

Concept: Sentence transformers map textual attributes (like product names or descriptions) into a dense vector representations. This captures the semantic essence of sentences or texts as the transformer has been previously trained to map such meaning. That way synonyms will be represented by similar vectors, which will provide a high similarity score between two similar products, such as Battery and Accumulator.

Use: To translate text into vectors and turn comparisons into mathematical operations. If several fields are selected, they are concatenated.

Cosine Similarity

Concept: It measures the cosine of the angle between two non-zero vectors.

Cosine Similarity(A,B)=A⋅B / ∥A∥∥B∥
where A and B are vectors.

Use: After transforming textual descriptors into vectors, cosine similarity is a kind of distance between two product descriptions, providing the similarity for textual attributes.

2. Hamming Similarity (Categorical Attributes)

Hamming Similarity

Concept: This metric quantifies the difference between two products defined by a set of categories, by checking the number of differences between the sets of values. If there are 4 categorical fields and 3 are similar between the two products, then the similarity will be 3/4 = 0.75.

Use: For categorical attributes, the Hamming similarity offers a way to determine the similarity of two product categories or hierarchies.

3. Mahalanobis Similarity (Numerical Attributes)

Mahalanobis Similarity

Concept: This metric provides the distance between numerical values after standardizing values for mean and standard deviation of each attributes, in order to compare numerical values on the same scale.

DM(x)= √((x−μ)^TS⁻¹(x−μ))
where x is a vector, μ is the mean of the distribution, and S⁻¹(x−μ) is the inverse of the covariance matrix.

Use: It is applied to numerical attributes, ensuring the same scale for each attribute to come up with the similarity for numerical attributes.

5. Weighted Average Similarity

Concept: In order to combine the textual, numerical, and categorical similarity score, a weighted average score is computed for each pair of products using the following formula:

Where:

wt is the weight for textual attributes.
St is the similarity score for textual attributes.
wc is the weight for categorical attributes.
Sc is the similarity score for categorical attributes.
wn is the weight for numerical attributes.
Sn is the similarity score for numerical attributes.

Use: This method offers users the flexibility to prioritize certain descriptor types, ensuring the composite similarity aligns with specific contexts or preferences.

6. Graph Construction & Community Detection

Concept: The final composite similarities lay the groundwork for a graph where products are nodes, and similarities define the edges between them. Originating from the graph theory and network science, the Leiden algorithm, used here, is renowned for detecting communities within complex networks. It optimizes modularity, a grouping quality metric, and ensures higher quality partitions compared to many other methods.

Use: In the scope of Product Similarity Accelerator, once the product graph is built, the Leiden algorithm identifies groups of similar products. These clusters represent products that are more similar to each other than to products outside their group, thereby defining 'similarity groups' of products.

Thus, users can extract meaningful clusters of similar products, offering deeper insights and aiding in data-driven decision-making processes.