Data Flow (Customer Insights)
Overview
In Customer Insights Accelerator, we use Data Load to aggregate Datamart data by customer, product and pricing month and we store it in the Data Source AggregatedData. It helps improve the dashboard performance because:
The system queries data in a smaller data set.
Some data need to be pre-processed before showing on dashboards as trend values, product and customer classification etc.
Note: There are some charts and portlets whose data are queried from the Datamart directly, not from AggregatedData Data Source, such as:
Customer Global View Dashboard: Customer Summary portlet
Customer Detail View: Customer Summary Portlet, Waterfall chart
To improve performance, the processed aggregated data is limited to the last 12 months of transaction data.
Configurable Batching for Aggregation Data Load
Configurable batching for aggregation data loads allows you to adjust batch sizes according to your specific dataset and hardware. By configuring the batch size, you can optimize performance, avoid data load failures, and ensure accurate aggregation results.
How to Configure Batching for Aggregation Data Load
Navigate to Analytics > Data Manager > Data Loads > Customer Insights Aggregation and open the Calculation tab.
In the Batching Dimension field, select the parameter that will be used to split the data for the batch.
Note: Use a customer attribute (customer type, customer group, …) to ensure all data for a particular customer is processed within a single batch.
Â
Save the Data Load.
When you define a batching dimension, the aggregation data load process creates batches based on the selected dimension. Only the necessary data for the specified batch will be loaded to optimize memory usage and performance.
Example of Batching
Batch Dimension = customer group
Batch01: customerGroup=A → batchFilter: customerGroup = A
Batch02: customerGroup=B → batchFilter: customerGroup = B
Batch03: customerGroup=C → batchFilter: customerGroup = C
Best Practices
Try to choose a batching dimension that will produce relatively even-sized data slices over the last 12 months.
Ideally, aim for batches of approximately 100,000 transaction rows each to achieve a balance between processing speed and system stability – it all depends on your cluster setup. If other batch sizes are working well for your setup, you can continue using them.
If you encounter data load timeouts, consider selecting a more granular batching dimension to create smaller, more manageable batches.
Limitations
When using our test dataset containing 41 million transaction rows (with 21 million rows of the last 12 months data), we encountered the following limitations:
A single calculation with more than 50.000 batches took approximately 7 hours to process and did not complete successfully.
Batches covering more than 1 million transaction rows frequently resulted in timeouts.