Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 2 Current »

The primary factor to consider when migrating Pricefx's PA solution from the legacy Greenplum deployment to the Citus-based architecture is the customer's existing database infrastructure. In situations where the PA data volumes have been assessed as surpassing the capabilities of a standard Postgres implementation, the Greenplum database would have been leveraged as a means of distributing the workload across multiple host systems and CPU cores within each host.

This distributed database architecture can also be achieved through the utilization of Citus's distributed table functionality. However, it is crucial to recognize that while Citus is supported within the NextGen environment, Greenplum is not. This distinction serves as a key driver motivating the migration from the legacy Greenplum setup to the more robust and versatile Citus-based architecture available in the NextGen platform.

By transitioning to the Citus solution in NextGen, customers can benefit from a more streamlined and reliable database infrastructure capable of meeting the complex and dynamic requirements inherent to Pricefx's PA ecosystem.

Rampur Upgrade Flowchart

The following illustration depicts the flowchart for a process involving the upgrade to Rampur version 13 and subsequent steps based on variety of conditions.

Citus_Greenplum_upgrade_13_flow_chart.png

Rampur Upgrade Flowchart Steps

Here is a detailed breakdown of the flowchart:

  1. Start

  2. Upgrade to 13

  3. Using Greenplum?

    • Yes: Move to the next decision point.

    • No: End of the process.

  4. Migrate to NextGen?

    • Yes: Create Citus DB Cluster, then Identify large DS/DMs, followed by Configure Distribution Keys, and finally Done.

    • No: Move to the next decision point.

  5. Use DM Publishing?

    • Yes: Schedule DM Publishing DL(s), then Done.

    • No: Rebuild distributed DS/DMs, then Done.

Rampur Upgrade Process Insights:

  • Upgrade Path: The process starts with an upgrade to version 13.

  • Greenplum Usage: The first decision point checks if Greenplum is being used and If not, then the process ends.

  • NextGen Migration: If Greenplum is used, the next decision point is whether to migrate to NextGen. If migrating, then a Citus DB Cluster is created, large DS/DMs are identified, and distribution keys are configured.

  • DM Publishing: If not migrating to NextGen, the next decision checks for DM Publishing usage.

    • If using DM Publishing, scheduling DL(s) is done.

    • If not using DM Publishing, distributed DS/DMs are rebuilt.

This flowchart provides a clear and structured approach to handle database upgrades and migrations based on specific conditions and requirements.

Additional PA Considerations for Rampur

Migrate to NexGen

The rationale for migration is that the Citus database solution employed in the NextGen environment is considered a more robust and capable option compared to the legacy Greenplum deployment. Greenplum, while functional, represents a more complex database system that requires extensive configuration and tuning efforts to ensure optimal performance across the wide-ranging and often dynamic requirements of Pricefx customers.

These customer-specific demands can encompass varied PA data schemas, significant data volumes, diverse reporting and dashboard queries, as well as the intricate pricing logic governing quotes, agreements, and batch processing workflows. Migrating to the NextGen platform with its Citus-based architecture provides a more streamlined and reliable database solution capable of meeting these complex operational needs.

Create Citus DB Cluster

The initial Citus cluster configuration for the migration involves a single Coordinator node paired with two Worker nodes. In this setup, all of the existing PA data is first migrated to the Coordinator node, which must be provisioned with sufficient computing resources to accommodate this data payload.

The Worker nodes, in contrast, will only house the data for the distributed tables. When a new distributed table is created, the data is removed from the Coordinator and instead populated across the Worker nodes through the execution of the IndexMaintenance job. This architectural design ensures an optimal distribution of the PA data workload across the Citus cluster.

LEARN MORE: To learn more about this process, click here.

Identify Large Data Sources (DS) and Data Marts (DM)

Unlike the legacy Greenplum deployment, the approach taken with the Citus-based migration does not involve automatically distributing the data across all Data Sources (DSs) and Data Marts (DMs). In contrast, a more selective approach is adopted, as most tables, when considering the total number of rows, do not stand to benefit from distributed data storage.

In fact, indiscriminate distribution can potentially have a detrimental impact on performance. Therefore, a careful evaluation is required to identify the specific DSs and DMs that would realize tangible performance gains from a distributed data architecture.

With the Citus-based migration, a more selective approach is required to identify the specific Data Mart (DM) tables that would benefit from distributed data storage. This selection process can be somewhat subjective, as the decision depends on factors such as the total row count of each table.

For example, the Company X production environment has been utilizing Greenplum due to performance limitations encountered with a 50 million row DM on a standard Postgres deployment. This was attributed to the relatively high number of Data Sources (DSs) and data fields that comprise the DM schema.

As a general guideline, tables exceeding 30 million rows could be considered candidates for distribution, though tables with over 100 million rows would more typically warrant this architectural approach. However, the specific thresholds should be determined based on an evaluation of the unique data volumes and schema complexity within each customer's PA environment.

When choosing which tables to distribute, we also need to consider the dependency between a DM’s distribution key, and that of its constituent DS(s). See the next step.

Configure Distribution Keys

This concept can be best illustrated through an example. When starting with a large Transactions Data Mart (DM), the obvious choice for the distribution key would typically be the sku or productId field. This is because the out-of-the-box functionality in Pricefx's PA solution is often oriented around product-centric data and workflows.

Additionally, it is quite common for the Product Master data to be significantly larger in scale compared to the Customer Master, for instance containing 100,000s of products versus 10,000s of customers. There also tend to be more product-related attributes sourced from the Product Data Source (DS) or other configured product-focused DSs, further supporting the utilization of a product-based distribution key.

NOTE: However, it is important to note that this is not a universal rule, as the optimal distribution key can vary based on the specific nuances of each customer's data landscape.

Distribution Key Examples

For example, in the case of the Company X deployment, the reverse scenario was true, with the customer data (reflected in the customerId field) comprising the more suitable distribution key for the Transactions DM.

Conversely, for Company Y Transactions DM, the productId was determined to be the more appropriate distribution key, given the larger product data set (~1.8 million products versus ~640,000 customers) and the more voluminous secondary product-keyed DSs feeding into the DM.

These examples illustrate the importance of carefully evaluating the unique data characteristics and relationships within each customer's PA environment to identify the most suitable distribution key for the DMs.

Distribution Key Configuration Summary

Once the optimal distribution key has been identified, it is crucial that this configuration is applied consistently across both the Data Mart (DM) and its corresponding primary Data Source (DS).

Deploying a DM without a defined distribution key, or implementing a distribution key that differs from the one configured for its constituent Data Sources, will result in a validation error. Similarly, modifying the distribution key of a DS will automatically render the DM invalid, causing any queries or jobs executed against the DM to fail unless the DM configuration is realigned to match the DS changes.

It is also important to note that the distribution key must be an actual key field within the schema of both the DS and DM in order to ensure referential integrity and proper data distribution across the Citus cluster.

Careful coordination of the distribution key settings across the DMs and their primary DSs is essential to maintain the integrity and operational reliability of the Pricefx PA solution.

Rebuild distributed DS/DMs

It is important to note that when an existing Data Source (DS) or Data Mart (DM) is configured to be distributed, simply deploying the new configuration does not automatically convert the underlying database table structure. Instead, an additional step is required to physically rebuild the table to align with the distributed architecture.

For this purpose, the corresponding IndexMaintenance Data Load (DL) must be executed in a non-incremental mode. This non-incremental mode ensures that the table is completely rebuilt from the ground up, even in cases where the distribution key configuration has not been modified.

It is worth noting that this IndexMaintenance DL may be renamed in a future product release to better reflect its intended purpose of converting standard tables to a distributed format, as opposed to its current naming which could be interpreted as solely focused on index maintenance.

This rebuild process is a crucial step in the migration workflow, as it transforms the existing DS or DM tables to leverage the distributed storage and processing capabilities provided by the Citus database architecture.

LEARN MORE: To learn more about this process, click /wiki/spaces/EN/pages/5045256196.

When rebuilding a table, a new one is created, populated from the original, then the latter is dropped and the new one renamed. In this sequence, the data in the Coordinator stored original table is deleted, after it’s been distributed over the shards of the new, distributed table.

This same method can be used when removing or changing the distribution key of a DS/DM.

Use DM Publishing

Why use this? There is a functional case and a performance based one. For a detailed explanation of what DM publishing entails,

LEARN MORE: To learn more about DM publishing functional and performance cases, click here.

When leveraging the Citus database solution, significant performance benefits can be realized through the utilization of the Publish DM database table. This is attributed to the column-oriented structure of the Publish DM table, coupled with the data compression capabilities inherent to its design.

NOTE: The column-oriented storage approach, in contrast to traditional row-oriented layouts, enables more efficient data retrieval and processing, particularly for analytical workloads common in Pricefx's PA ecosystem. Furthermore, the compressed data format reduces the overall storage footprint and memory requirements, further contributing to enhanced system performance.

By taking advantage of these Citus-specific features within the Publish DM table, customers can expect to achieve notable performance improvements when querying and interacting with the published PA data.

NOTE: When using Citus, there is significant performance to be gained from the fact that the Publish DM DB table is column oriented, and its data is compressed.

Scheduling DM Publishing DL(s)

The moment a DM’s Publishing DL has run for the first time, any client or logic querying the DM data will see only this published data. New or modified data loaded in its DSs, will not show until after the next Publishing DL run. Clearly, this new DL needs to be appropriately scheduled into the overall PA data load sequence.

Note also that the window between the DM Refresh and DM Publishing, is where enrichment and transformation of the DM data is to be done by means of (Distributed) Calculation DL(s).

  • No labels