Data Mart Publishing in Rampur

The article discusses the changes in the way data marts (DMs) are managed in Rampur 13.0 and highlights these key points:

DMs now store all fields in their own tables, and the normalization setting is dropped.
The Refresh DL picks up new and updated rows from the constituent data sources, but this 'refresh data' is not yet visible to queries until the new DM Publish DL is run.
The published data table is designed to be read-only, taking advantage of column-store performance benefits, while the source data continues to be stored in regular row-oriented tables.
The impact on the Groovy and REST/JSON APIs is minimal, with a new method added to the DatamartContext interface to allow accessing the refresh data for enrichment jobs.
After upgrading to 13.0, the Publish DL is not automatically run or scheduled, and queries will continue to see the unpublished data until the Publish DL is first run.

Data Mart Issue Example

The typical schema for a Data Mart (DM) follows a snowflake design. In the example provided, the DM sources fields from the Invoice, Product, Product Hierarchy, and Customer Data Sources (DSs). The Product Hierarchy levels are indirectly linked to a transaction via the Product Group field.

When data is loaded into the Data Mart (DM) through the DM Refresh Data Load (DL) and optional PA Calculation DLs, not all fields are physically stored in the DM table in the PA DB. The number of fields sourced at query time from the constituent DSs depends on the normalization setting of the DM.

There are various drawbacks with this approach:

Data Mart (DM) data (more precisely, the result of a query on a DM), can change as new or updated rows are loaded in a Data Source (DS), if the modified data is in fields not persisted in (copied over to) the DM table.

NOTE: Intuitively, it might be expected that DM data is only modified when the Refresh DL is run, but as seen here, that is not necessarily the case.

When the Data Mart (DM) contains fields that are calculated by a PA Calculation Data Load (DL), there exists a time window in which newly loaded rows have not yet been enriched by the calculation(s). During this window, the DM can be considered to be in an inconsistent state.

Data Mart Issue Resolution

To address these issues, Rampur 13.0 takes a slightly different approach:

A Data Mart (DM)always stores all fields in its own table. The normalization setting is dropped.
The Refresh DL picks up all the new and updated rows from the constituent DSs, but when the DM is queried, this new data ('refresh data') is not yet visible. Only when a new type of DL is run - the DM Publish DL - is this data exposed to queries from Dashboards, Pricing Logic etc.

With the new approach in Rampur 13.0, a specific cutover point can be chosen to make the new data available to DM clients. Additionally, PA calculations can enrich the refreshed data without impacting the results of DM queries from these clients.

Data Mart Performance

With Citus, the self-contained published data table is created as a columnar table, with compression, providing performance benefits for analytical workloads. The source data continues to be stored in regular row-oriented tables, which are far more suited for (regular) updates.

Transition from Pre-Rampur 13.0 Approach

When upgrading to version 13.0, the DM Publish Data Load (DL) is not automatically run, or scheduled. There is no migration approach that can predict the desired behavior for each individual customer.

Prior to the completion of any DM Publish job, a query on the DM will return the exact same result as before the upgrade. Once the Publish DL has been executed for the first time, the query will utilize the published data exclusively.

From that point forward, a DM query may return a result that deviates from the loaded data, if the data has not yet been published.

Data Mart Enrichment

Enriching a Data Mart (DM) means populating placeholder fields, persisted fields defined in the DM but not sourced from any DS, by means of a PA Calculation DL. These types of fields are used when they cannot be (easily) calculated using a forward expression (ie. NetProfitMargin = 1 - InvoicePrice/Cost).

In Clover Club 12.x release and before, it is possible to change fields that are sourced from a Data Source (DS) in a PA Calculation . The back door approach is to use the DatamartRowSet api:

api.getDatamartRowSet(“target”).addRow(<field values as map>).

This way, any field that is stored in the Data Mart (DM) table can be modified, but it’s clear that this is an il-advised design: since data in a DM is initially loaded from Data Sources, modifying this data means that queries can produce significantly different results at different times.

Additionally, this approach would make it unclear to end users what a chart or data table in a dashboard represents. Furthermore, the DM fields that can be populated in this manner depend on the normalization of the DM, which adds to the confusion, as the determination of that set of fields is a technical implementation detail that is not guaranteed to be consistent across different releases.

NOTE: For these reason, Rampur 13.0 release will not support this scenario.

Data Mart Loading

Another unfortunate design could be that the rows in a Data Mart (DM) are generated by a PA Calculation job. Again, this would be ill-advised since a Data Mart (DM) is to be populated by its DM Refresh DL.

But if the refresh is never run, and the DM has been configured with ‘None’ normalization, then this approach could yield the desired result.

This is not an approach we want to support in Rampur 13.0 and beyond. The correct approach will be to populate the DM’s main DS instead.

Data Mart Publishing and Groovy API

The impact on the Pricefx Groovy API is minimal since it is assumed that all clients of the Data Mart (DM) intend to use the published data only. However, there is one exception in regard to PA Calculation/Enrichment job, since these need to have access to the newly loaded ('refresh') data.

For this reason, one method (getDatamart) is added to the DatamartContext interface:

/**
  * Gets a table object representing a Datamart with the given name, using either it's refresh or published data,
  * depending on the value of the useRefreshData argument.
  * This method is intended to be used in jobs that enrich/transform DM data, requiring access to data that
  * has been loaded (through the DM's Refresh DL), but has not yet been published (by the DL's Publish DL).
  * A reference to this table can be used when building a {@link Query} on that Datamart.
  * @param name sourceName, uniqueName or label of the DM.
  * @param useRefreshData
  * @return Table representing the DM in DataContext.
  */
Table getDatamart(final String name, Boolean useRefreshData);

Data Mart Publishing and REST/JSON API

While do not expect external clients to want to access not yet published Data Mart (DM) data, there is of course one exception: a data manager user will want to see the refresh data, after it’s loaded and before it is (optionally) fully enriched and published.

Therefore, we allow this with a URL param in the datamart.fetchdata endpoint:

pricefx/mb1/datamart.fetch/107.DM?refreshData=true

Data Mart Publishing and Rampur Related Upgrades

It is not mandatory to run or schedule the Publish DL after upgrading to Rampur 13.0. However, as mentioned previously, as long as the Publish DL is not executed, the system-generated Publish DL will remain in the DRAFT state, and Data Mart (DM) queries will continue to find the unpublished data, now also referred to as the refreshed or staging data, as they did before.

When a Data Mart (DM) is refreshed, any new and modified data will immediately be reflected in query results. The same is true when the DM is truncated.

After upgrading to Rampur 13.0, there is an incentive to publish DM data, as the use of Citus is expected to significantly improve query performance. The main reason for not running or scheduling the Publish DL immediately when upgrading is that this additional step needs to be properly integrated into the customer's existing data load workflow.