pfx-gp:loaddata

Loads data into Greenplum. The data must be present as CSV files (possibly gzipped) on the Greenplum server. The component does not solve the transport of the data files to the server.

In the background, the method does three steps:

  1. Generate a control file for the data load.

  2. Upload the control file to the Greenplum server.

  3. Execute the gpload binary on the Greenplum server. 

 

To learn more about gpload and control file, see the Greenplum documentation.

Component Input

Input

Location

Type

Description

Input

Location

Type

Description

filePath

Exchange body

String

(Mandatory) Specifies the location of a file, named pipe, or directory location on the local file system that contains data to be loaded. You can declare more than one file as long as the data is of the same format in all files specified.

If the files are compressed using gzip or bzip2 (have a .gz or .bz2 file extension), the files will be uncompressed automatically (provided that gunzip or bunzip2 is in your path).

When specifying which source files to load, you can use the wildcard character (*) or other C-style pattern matching to denote multiple files. The files specified are assumed to be relative to the current directory from which gpload is executed (or you can declare an absolute path).

inputColumns

Header 

PfxGp.InputColumns

List<Map<String, String>>

Specifies the schema of the source data file(s) in the format of <field_name>: <data_type>. 

If the input inputColumns are not specified, the schema of the outputTABLE is implied, meaning that the source data must have the same column order, number of columns, and data format as the target table.

The default source-to-target mapping is based on a match of column names as defined in this section and the column names in the target TABLE. This default mapping can be overridden using the columnMapping section.

columnMapping

Header
PfxGp.Mapping

List<Map<String, String>>

If a mapping is specified, it overrides the default source-to-target column mapping. The default source-to-target mapping is based on a match of column names as defined in the source COLUMNS section and the column names of the target TABLE.

A mapping is specified as either: <target_column_name>: <source_column_name> or <target_column_name>: '<expression>'
where expression is any expression that you would specify in the SELECT list of a query, such as a constant value, a column reference, an operator invocation, a function call, and so on.

Properties

Option Category

Option

Type

Default Value

Description

Option Category

Option

Type

Default Value

Description

Greenplum control file options

version

String

1.0.0.1

The version of the gpload control file schema.

database

String

 

Specifies which database in Greenplum to connect to. (Mandatory)

masterHost

String

 

Specifies the Greenplum master host name. (Mandatory)

masterPort

Integer

 

Specifies the Greenplum master port. (Mandatory)

dbUser

String

 

Specifies which database role to use to connect. (Mandatory)

dbPassword

String

 

Specifies password of a user to use to connect. (Mandatory)

portFrom

Integer

8000

Together with portTo defines the port range that the gpfdist file distribution program should use.

portTo

Integer

9000

Together with portFrom defines the port range that the gpfdist file distribution program should use.

format

Enum

CSV

Specifies the format of the source data file(s) – either plain text (TEXT) or comma-separated values (CSV) format.

Values: CSV, TEXT

delimiter

String

,

Specifies a single ASCII character that separates columns within each row (line) of data. The default is a tab character in the TEXT mode, a comma in the CSV mode. You can also specify a non-printable ASCII character or a non-printable Unicode character, for example: "\x1B" or "\u001B". The escape string syntax, E'<character-code>', is also supported for non-printable characters. The ASCII or Unicode character must be enclosed in single quotes. For example: E'\x1B' or E'\u001B'.

escape

String

\

Specifies a single character that is used for C escape sequences (such as \n,\t,\100, and so on) and for escaping data characters that might otherwise be taken as row or column delimiters. Make sure to choose an escape character that is not used anywhere in your actual column data. 

It is also possible to disable escaping in text-formatted files by specifying the value 'OFF' as the escape value. This is very useful for data such as text-formatted web log data that has many embedded backslashes that are not intended to be escaped.

quote

String

"

Required when format is CSV. Specifies the quotation character for the CSV mode.

header

Boolean

false

Specifies that the first line in the data file(s) is a header row (contains the names of the columns) and should not be included as data to be loaded. If using multiple data source files, all files must have a header row. The default is to assume that the input files do not have a header row.

forceNotNull

String array

 

Names of columns that cannot be null. In the CSV mode, it processes each specified column as though it were quoted and hence not a NULL value. For the default null string in the CSV mode (nothing between two delimiters), this causes missing values to be evaluated as zero-length strings.

errorLimit

Integer

0

Enables the single row error isolation mode for this load operation. When enabled, input rows that have format errors will be discarded provided that the error limit count is not reached on any Greenplum segment instance during input processing. If the error limit is not reached, all good rows will be loaded and any error rows will either be discarded or logged to an error table. The default is to abort the load operation on the first error encountered.

Note that single row error isolation only applies to data rows with format errors; for example, extra or missing attributes, attributes of a wrong data type, or invalid client encoding sequences. Constraint errors, such as primary key violations, will still cause the load operation to abort if encountered.

logErrors

Boolean

false

If true, gpload creates an internal error table where rows with formatting errors are logged when running in the single row error isolation mode. You can then examine this error table by using GPDB built-in function gp_read_error_log() to see error rows that were not loaded (if any). This could be enabled only if errorLimit is greater than zero.

table

String

 

The name of the target table to load into. (Mandatory)

mode

Enum

MERGE

There are three available load modes:

  • INSERT – Loads data into the target table using the following method: INSERT INTO target_table SELECT * FROM input_data;

  • UPDATE – Updates updateColumns of the target table where the rows have the matchColumns attribute values equal to those of the input data, and the optional updateCondition is true.

  • MERGE – Inserts new rows and updates the updateCondition of the existing rows where the matchColumns attribute values are equal to those of the input data, and the optional updateCondition is true. New rows are identified when the matchColumns value in the source data does not have a corresponding value in the existing data of the target table. In those cases, the entire row from the source file is inserted, not only the MATCH and UPDATE columns. If there are multiple new matchColumns values that are the same, only one new row for that value will be inserted. Use updateCondition to filter out the rows to discard.

matchColumns

String array

 

Required if mode is UPDATE or MERGE. Specifies the column(s) to use as the join condition for the update. The attribute value in the specified target column(s) must be equal to that of the corresponding source data column(s) in order for the row to be updated in the target table.

updateColumns

String array

 

Required if mode is UPDATE or MERGE. Specifies the column(s) to update for the rows that meet the matchColumns criteria and the optional updateCondition.

updateCondition

String

 

Specifies a Boolean condition (similar to what you would declare in a WHERE clause) that must be met in order for a row in the target table to be updated (or inserted in the case of a MERGE).

truncate

Boolean

false

If set to true, gpload will remove all rows in the target table prior to loading it.

reuseTables

Boolean

false

If set to true, gpload will not drop the external table objects and staging table objects it creates. These objects will be reused for future load operations that use the same load specifications. This improves the performance of trickle loads (ongoing small loads to the same target table).

SSH server options

sshHost

String

 

Specifies a hostname of the Greenplum server that will be used for connection via SSH. (Mandatory)

sshPort

Integer

22

Specifies a port on the Greenplum server that will be used for connection via SSH.

sshUsername

String

 

Specifies a username that will be used for connection via SSH. (Mandatory)

sshPassword

String

 

Specifies a password that will be used for connection via SSH. (Mandatory)

sshDirectory

String

/

Specifies a starting directory. The folder will be used for upload of the gpload control file.  Usage of the user's home directory (or its subdirectory) is recommended. 

Limitation: The directory must exist – SCP is not able to create directories.

Loaddata Response

pfx-gp:loaddata returns a response that describes the result of data load with the following attributes:

Response Attributes

Description

Response Attributes

Description

startTime

Start time of data load.

schema

Schema of the destination table.

table

Destination table name.

command

gpdist raw command string.

formattingErrors

Number of data formatting errors.

rowsUpdated

Number of updated rows.

rowsInserted

Number of inserted rows.

badRows

Number of bad rows.

runningTime

Running time of data load.

errors

List of error messages.

warnings

List of warning messages.

status

Status of data load. Values: SUCCESS, SUCCESS WITH WARNINGS, FAILED

unmatchedMessages

List of unmatched lines of data load process output.

 

Examples

This example demonstrates usage of the pfx-gp component. The route loads data from CSV files (on the Greenplum server) in /home/gpadmin/data/ into the 'transactions' table in the 'test' database.

Sample Data

The content of the file with the data to be loaded located on the Greenplum server as /home/gpadmin/data/data1.csv

abc1,50.27 abc2,124.56 abc3,87.11 abc4,42.64

 

Greenplum Table 

There is the 'transactions' table in Greenplum that will be used for the data load.

create table transactions(     transaction_id varchar primary key,     net_amount numeric );

 

Data Loading Route

The route will load data from all files located in /home/gpadmin/data/* into the Greenplum database table.

<routes xmlns="http://camel.apache.org/schema/spring">       <route id="gp-component-test">           <!-- Trigger processing -->         <from uri="timer:in?repeatCount=1"/>           <!-- Specifies the schema of the source data file(s) -->         <setHeader headerName="PfxGp.InputColumns">             <groovy>[ ['transaction_id':'varchar(255)'], ['net_amount': 'numeric'] ]</groovy>         </setHeader>           <!-- Set path mask for csv files that will be loaded into Greenplum -->         <setBody><simple>/home/gpadmin/data/*</simple></setBody>           <!-- Execute loaddata -->         <to uri="pfx-gp:loaddata?database=test&masterHost=localhost&masterPort=5432&dbUser=gpadmin&dbPassword=password&table=Transactions&updateColumns=net_amount&matchColumns=transaction_id&errorLimit=10&logErrors=true&sshHost=127.0.0.1&sshPort=2022&sshUsername=gpadmin&sshPassword=gpadmin&sshDirectory=/home/gpadmin"/>                   <!-- Optional - log gpload output -->         <log message="${body}"/>       </route>   </routes>

IntegrationManager version 5.8.0