Distributed Data Marts and Data Sources in Rampur

In the Rampur release (v13), within application clusters integrating Citus into the Postgres PA DB, users will have the capability to establish distributed tables for storing Data Source (DS) and Data Mart (DM) data. In Citus, a distributed table necessitates the assignment of a single column as the designated distribution key.

When selecting the distribution column, it is important to consider certain limitations and desired characteristics to ensure optimal performance and efficiency.

  • values in this column cannot be NULL

  • a PK on a distributed table requires the distribution column to be part of the unique constraint definition

  • should have high cardinality and an even distribution of values

  • a column often used in joins and filter clauses benefits query performance

For these reasons, it was opted to require that the distribution key be a key field of the DS/DM. Even when the DS or DM does not strictly require the field to be defined as key, specifying it as such should always be possible.

NOTE: It is expected that in all but a few exceptions, either the product ID or customer ID field will be picked as the distribution key.

LEARN MORE: For more information on Citus, click here.

Using Distributed Datamart

To achieve optimal performance from a distributed Data Mart (DM), it is essential to ensure that the field serving as the distribution key is consistently defined as the distribution key in the corresponding Data Source (DS) from which the data is derived.

For example, when a Transaction DM defines productId as the distribution key, and the DM sources this field from the Transaction DS, then that DS needs to also define productId as the distribution key. If not, a validation error will be returned when deploying or querying the DM.

In addition, though optional, unless the Product DS is trivially small, it is recommended to define the productId key field there also as a distribution key. This schema ensures the greatest benefit when loading and querying the DM.

How configure a DS or DM to be distributed?

In the user interface (UI), perform the following:

  1. Proceed to the Data Source (DS) or Data Mart (DM) section within Analytics → Data Manager.

  2. Access the JSON definition by selecting Import & Export.

  3. Locate the intended distribution field and verify that it is designated as a key.

  4. Set the "distributedKey" property to true.

  5. Finally, click on Apply and save the changes made to the DS/DM.

image-20240408-115427.png

Image Summary: A basic data management capability focusing on importing and exporting data definitions in JSON format. Some key components of this screen:

  • The JSON code inside the pop-up window defines various parameters for a data source named "Products".

  • Users can copy the JSON template, modify it as needed, and then apply it to configure their data source.

  • Provides options to add or modify fields, as well as import and export these configurations.

When a Data Mart source (i.e., a Data Source) is distributed, the Data Mart will automatically be distributed on the same key during deployment, and if it was previously deployed it will need to be rebuilt.

New Data Source or Data Mart

When a DS/DM has not previously been deployed, i.e. no table for it exists in the PA DB, then a distributed table is created for it on the first deploy. In new implementations, it is recommended to choose the distribution key and configure it in a necessary DSs and DMs from the start

Existing Data Source or Data Mart

When a table is already present in the PA DB, it does not undergo automatic conversion to a distributed table upon deployment, even if the Data Source (DS) or Data Mart (DM) is reconfigured to have a distributed key.

Instead, the table is rebuilt as a new and distributed table, when the IndexMaintenance job for the DS/DM is run in non-incremental mode. If the distribution key configuration was modified, but this job has not yet been run, the server log will reflect this with clues such as

  • "NEED TO REBUILD TABLE <table name> for the change in storage type/partitioning key/distribution key to take effect !!!"

  • "Distribution key changed from 'not distributed' to 'distributed by ProductID:attribute16'".

Found an issue in documentation? Write to us.