How to Configure a Distributed Data Source or Datamart

In version 13.0, distributed Data Sources and Datamarts have been introduced. This article helps you make the required configuration settings to switch to the distributed database.

Note: This feature is intended for very large Data Sources and Datamarts and it is enabled upon request by a support ticket.

To configure a Datamart or Data Source as distributed, follow these steps:

  1. In the Unity user interface, navigate to Analytics > Data Manager > Data Sources or Datamarts.

  2. Open a Data Source or Datamart and click the Import & Export button.

  3. In the JSON definition, locate the desired distribution field and

    1. Make sure it is defined as a key – "key": true,

    2. Set the "distributionKey" property to true.

  4. Click Apply.

When a Datamart source, i.e., a Data Source, is distributed then the Datamart (on deployment and rebuild if previously already deployed) will be automatically distributed on the same key.

New Data Sources / Datamarts

When a Data Source or Datamart has not previously been deployed, i.e., there is no corresponding table in the database, then a distributed table is created for it on the first deploy. In new implementations, it is recommended to choose the distribution key and configure it in the necessary Data Sources and Datamarts from the beginning.

Distribution Key Requirements

  • Key Field Designation – The distribution key should be a key field of the DS/DM. Typically, either the product ID or customer ID field is chosen as the distribution key.

  • Consistency Across DS and DM – To maximize performance, the field used as the distribution key in the DM must also be defined as the distribution key in the DS from which it is sourced.

  • Recommendation for Product DS – Unless the Product DS is very small, it is recommended to define the productId key field as a distribution key. This ensures optimal performance during data loading and querying.

Existing Data Sources / Datamarts

When a table already exists in the database, it is not automatically converted on deployment to a distributed table after the Data Source / Datamart is reconfigured to have a distributed key. Distributed tables will normally be used to hold large amount of data and altering a table on the fly in a UI transaction would very quickly time out.

Instead, the table is rebuilt as a new and distributed table when the IndexMaintenance job for the Data Source / Datamart is run in non-incremental mode.

Upgrade to 13.0

The Publishing Data Load is not automatically created when upgrading to 13.0. Instead, it is created when the reconfigured Datamart is redeployed.

Before the Publishing Data Load is run, a query on the DM will return the exact same result as before the upgrade. Once the Publishing Data Load has run for the first time, the query will use the published data only. From this moment on, a Datamart query can return a result deviating from loaded data, if this data is not yet published.

It is not required to run or schedule the Publish DL after upgrading to 13. As previously mentioned, as long as it is not run, i.e., the system-generated Publishing Data Load remains in the DRAFT state, Datamart queries will find the unpublished data (now also called refreshed or staging data), as before. When a Datamart is refreshed, any new and modified data will immediately be reflected in query results. The same applies when the Datamart is truncated.

After the Publishing Data Load has been run, this behaviour changes. From this moment on, the only way to expose changes to the data is to re-publish the data, including when truncating the Datamart.

After upgrading, there is an incentive to publish the Datamart data, as with CItus the query performance should be much improved. The main reason for not running or scheduling the Publishing Data Load when upgrading is that this additional step needs to be appropriately fitted into your data load flow.

Datamart Enrichment

Enriching a Datamart involves populating placeholder fields not sourced from any Data Source using a Calculation Data Load, which is necessary for fields not easily calculated with forward expressions. In versions 12.x and before, fields sourced from a Data Source could be changed using the DatamartRowSet API, but this led to inconsistent query results and user confusion. Therefore, version 13.0 will not support modifying Datamart fields this way.

Datamart Loading

Generating Datamart rows using a Calculation job is not a recommended configuration, as Datamarts should be populated by their Refresh Data Load. Though this might work with 'None' normalization if the refresh is never run, this approach won't be supported in version 13 and beyond. The correct method is to populate the DM's main DS.

Groovy API

The impact on the Groovy API is minimal, as it is assumed that all DM clients intend to use only the published data, with one exception: the Calculation/Enrichment job, which requires access to the newly loaded ('refresh') data. Therefore, a single method is added to the DatamartContext interface to accommodate this need:

getDatamart(String name, Boolean useRefreshData)

REST/JSON API

While the requirement to access unpublished Datamart data by external clients is not expected, there is one exception: a data manager user may want to see the Refresh data, after it’s loaded and before it is (optionally) fully enriched and published. We allow this with a URL parameter in the datamart.fetchdata endpoint, for example:

pricefx/customer/datamart.fetch/107.DM?refreshData=true

Normalization

Normalization is discontinued in version 13.0 and the setting is no longer available on the Datamart page. Whether you migrate to Citus DB or not, the Datamart table structure in the database is the same and does not use normalization any more.

Found an issue in documentation? Write to us.

Â