Data Loads
A Data Load represents a task/process for data manipulation between Analytics objects, such as uploading data from Data Feed to Data Source, deleting rows from Datamart or calculation of new values of the fields.
Most Data Loads are created automatically (when you deploy a Data Source or Datamart) but you can also create them manually (e.g., a calculation Data Load to manipulate data).
Data Loads provide the following actions:
Type | Description | Available for |
---|---|---|
Purge Cache | Clears the cache. The cache applies to queries on Data Sources and Datamarts, from the UI and from logics. There is no expiration, but it is limited in size (default is 1000 entries / partition). It is a LRU (Last Recently Used) cache. It gets invalidated after a data change (load, flush, refresh). Cache purging is a recovery mechanism. |
|
Truncate | Deletes (all/filtered) rows in the target. Note: When a Data Source is deployed, the Truncate Data Load of the linked Data Feed is updated with a filter to include only rows previously successfully flushed to Data Source and it is scheduled to run once a week. This applies only if there is no other filter or schedule already defined. Incremental mode is no longer available for Truncate jobs. For older jobs (created before upgrade to Collins 5.0 release) where this option was enabled, it will stay enabled. If you disable the Incremental option the check-box will become non-editable and you will not be able to enable the option again. For Data Loads saved with the Incremental option off, the check-box is completely hidden. |
|
Upload | Loads an XLS/XLSX sheet or a CSV/TXT file. Supports scientific notations of numbers: 1e02, 1E02, 1e+02, 1E+02, 6e-04, 6E-04. |
|
Mass edit | Allows you to apply mass changes to editable attributes. See Common Table Actions for details. |
|
Flush | Copies data from the Data Feed into the Data Source. It can also convert values from string to proper data types set in the Data Source. It can copy everything or just new data (i.e., incremental Data Load). |
|
Refresh | Copies data from Data Sources (configured in the Datamart fields) into the target Datamart. It can copy everything or just new/modified data (i.e., incremental Data Load). If a source filter is set on the Source tab, only the filtered rows are loaded to the Datamart. Rows that do not meet the filter’s condition and are present in the Datamart are automatically deleted from the Datamart when the Refresh job is run. If you want to run a non-incremental refresh but avoid the costly merging of almost the same data, you can truncate the DM first – set the advanced configuration option 'truncateDatamartOnNonIncRefresh' to true. Note: Since Godfather 8.1, rows updated during Refresh behave differently: their calculated fields are cleared to NULL instead of being persisted. For details see the release notes. |
|
Calculation | Applies a logic (defined in Configuration) to create new rows, or change/update values in existing rows in the target Data Source or Datamart. The calculation can take data from anywhere, e.g., Master Data tables. Example usage:
|
|
Calendar | Generates rows of the built-in Data Source "cal" and you get a Gregorian calendar with US locale. (If you need any other business calendar, just upload the data into the "cal" Data Source from a file or via integration and do not use this Data Load). | |
Customer | Special out-of-the-box Data Load which copies data from the Master Data table "Customer" into the Data Source "Customer". | |
Product | Special out-of-the-box Data Load which copies data from the Master Data table "Product" into the Data Source "Product". | |
Simulation | Applies a logic to the data as defined by the simulation for which the Data Load was created. |
|
Internal Copy | Copies data from a source into the Data Source table. The source here can be:
The easiest way to create this type of Data Load is to create a new Data Source from Template and deploy it; this automatically creates the Data Load and pre-fills the columns. The incremental mode in Internal Copy tasks is not exactly the same as in the Refresh or Calculation type. Here, incremental means the Data Source will not be truncated before the copy, i.e., it will keep old records instead of being a true copy. |
|
Index Maintenance | This task can be run to repair indexes associated with the target Data Source or Datamart, typically after backend DB migration. The task should be run only in these special circumstances, not on a regular or scheduled basis. We also strongly recommend consulting Pricefx support before you run this task. |
|
Distributed Calculation | Allows you to split a large set of data into batches and process them independently. See Distributed Calculation in Analytics for details. |
|
Publishing | Publishes data after Refresh which makes new data accessible by queries. It is system generated and cannot be created manually. |
|
On the Data Loads page, Data Loads are by default grouped by Data Load type on the first level and target type on the second level. The Target column indicates for which object the Data Load is used.
In this list, the Delete button only shows if all selected Data Loads can be deleted, i.e., they are either invalid (e.g., their target object is deleted), or they are not the default Data Load created by the system. (These system generated Data Loads cannot be deleted by a user, only if they are invalid.)
Click on a Data Load's label to display the job details. At the top there are buttons to run the Data Load manually or cancel the load. (Data Loads have a default timeout of 48 hours to accommodate even large and complex jobs. If needed, the jobs can be cancelled here.)
Depending on the Data Load type, you will find here some of these sections:
Options – Allows you to make Data Load settings:
Target Datamart – Specifies the target of the data coming from the Data Load operation.
For the Flush operation, the target is one of the Data Sources.
For Refresh, the target is one of the Datamarts.
For Calculation, the target is either Data Source or Datamart which you would like to update or enrich with columns.
+ Add Filter – You can filter data on the Target tab using the Advanced Filter.
Data Source – Specifies the source of data.
For the Flush operation, it will be a Data Feed.
For Refresh it is not specified because there can be more sources for one target Datamart. Those source Data Sources are specified in the columns definition of the Datamart.
+ Add Filter – You can filter data on the Source tab using the Advanced Filter.
Allow batching – Large amounts of data (more than 5 million rows) can be processed in batches. By default, batch processing is enabled for Flush operations and disabled for Calculation. You can override this setting here. (The batch size is an instance parameter and the default value is 2 million rows. There must not be any dependencies between rows belonging to different batches.)
Since version 13.3 - Rampur, when the Allow Batching option is set on a Calculation Data Load (DL.allowBatching
) that is configured with feeder logic, thePACalculationTask
will accumulate the target rows produced by the logic across feeder items until a batch is filled. Only then are the rows uploaded to the target. Previously, the rows were uploaded after the evaluation of each feeder item.Validation Logic – You can select a validation logic that will validate the target data after refresh. The following rules apply:
The target Datamart name is available in the validation logic through the "dmName" binding variable:
def dmName = api.getBinding("dmName")
Use the name to query for the Datamart's data and apply custom validation rules.
Raise a warning with a custom validation message when a validation rule is not satisfied:
api.addWarning("Missing value in field1")
When api.addWarning() is invoked, the data validation is considered as failed and the Data Load's status is set to Error. Note that the validation logic result does not affect the Data Load process itself as it is run after the Data Load is completed.
Validation messages passed from validation logic are present among the Data Load's calculation messages and can be viewed in the UI.
Overview – Summarizes the basic information on the Data Load.
Schedule – Here you can also schedule the Data Load manually (described below). The Job/Task Tracking section at the bottom shows the status for each task of the Data Load.
Target – Displays the complete target data set. The available options are the same as for Data Sources. View preferences are not available in this table. Instead, quick and advanced filter and sorting are saved automatically when you save the Data Load.
Source – Displays the complete source data set. The available options are the same as for Data Sources. View preferences are not available in this table. Instead, quick and advanced filter and sorting are saved automatically when you save the Data Load.
Calculation – Specifies the logic to be executed. Such logic is set up in Administration > Logics > Analytics. It can manipulate the data coming from the Source in many ways, e.g., filter the incoming rows from Source, create new lines for Target, modify/enrich/transform the data being copied from the Source into the Target. Click Default Formula to open the currently active Analytics default logic in an editor.
If you leave the Target Date field empty, the calculation will use "today" as the target date.Compare, Staging, Published – In Publishing type, these tabs display unpublished, refreshed data (Staging), published data and their comparison.
Schedule a Data Load
To schedule a Data Load:
On the Schedule tab, click the Add Task button.
Enter a Start Date.
Optionally, select a Time Zone. This ensures that the job will always run at the selected local time even after switching to/from the daylight saving time. Check the Next Run column to verify that your setting is correct.
The reason for making the setting here is that the Default Timezone option in General Settings sets only the offset from UTC. This offset and the UTC remain the same when your time zone switches to/from the DST. It means that if you set the start time, for example, to 10 AM CET in the winter (UTC+1) then after switching to the DST (UTC+2), your job will run at 9 AM CEST because your offset from the UTC remains +1.Enter a Period (the load will run every X minute, hour or day).
Enter an Interval representing the number of repetitions in the selected period (e.g., if you set Period = day and Interval = 1 and Start Date = 27/11/2015 10:00, the Data Load will run every day at 10:00 AM, starting on 27/11/2015). You can also enter "0" for jobs that you want to run just once.
If the Incremental option is checked, only new/changed data is loaded.Load Date indicates the last time this task ran successfully. Only rows loaded/updated since this time are considered in the Data Load. The Load Date can be edited to force in data older than the last successful run date.
Enter the name of the task.
For Calculation Data Load type, you can also enable the option With target snapshot, meaning that the target rowset will be pre-populated with the target rows in the scope of the Data Load. Otherwise this rowset is empty. For details see the note below.
Rowset holds the rows to be loaded in the target Field Collection. Initially this rowset is empty.
Target and Source are automatically generated. You need to specify these only if you set up the Data Load manually.
Source Snapshot Option
Available for Calculation Data Loads. This option is enabled by default, which means that the source DatamartRowSet is populated with all the source data.
If disabled, the Data Load job does not populate the source DatamartRowSet. This is useful when the calculation logic generates or queries the data itself, rather than iterating over the (pre-populated) source row set.
Target Snapshot Option
The target rowset represents the updates you want to apply to the actual target data once the Data Load execution is ended. So an empty target rowset means no change will be applied.
If the With target snapshot option is:
Enabled – The Data Load starts by copying the whole target into the rowset (in the scope defined by the
DL.filter
).Disabled – The target rowset is empty; no change is applied.
Note: To delete data, you add a row with the correct key values, and set its isDeleted
field to true
. See also a Knowledge Base article on this topic.
Example of use: The option With target snapshot can be used in the following advanced use case. When a Calculation job is started, there are two DatamartRowSets available in the formula context: source and target (api.getDatamartRowSet("source")...). The target rowset is initially empty, unless the 'With target snapshot' was checked. The typical use case would be a Flush, where the data to be loaded in the Data Source depends on which data is already there, with Groovy code using the DatamartRowSet API to find and inspect existing rows etc.
Found an issue in documentation? Write to us.
Pricefx version 13.1