Loads data into Greenplum. The data must be present as CSV files (possibly gzipped) on the Greenplum server. The component does not solve the transport of the data files to the server.

In the background, the method does three steps:

Generate a control file for the data load.
Upload the control file to the Greenplum server.
Execute the gpload binary on the Greenplum server.

To learn more about gpload and control file, see the Greenplum documentation.

Component Input

Input	Location	Type	Description

Input

Location

Type

Description

filePath

Exchange body

String

(Mandatory) Specifies the location of a file, named pipe, or directory location on the local file system that contains data to be loaded. You can declare more than one file as long as the data is of the same format in all files specified.

If the files are compressed using gzip or bzip2 (have a .gz or .bz2 file extension), the files will be uncompressed automatically (provided that gunzip or bunzip2 is in your path).

When specifying which source files to load, you can use the wildcard character (*) or other C-style pattern matching to denote multiple files. The files specified are assumed to be relative to the current directory from which gpload is executed (or you can declare an absolute path).

inputColumns

Header

PfxGp.InputColumns

List<Map<String, String>>

Specifies the schema of the source data file(s) in the format of <field_name>: <data_type>.

If the input inputColumns are not specified, the schema of the outputTABLE is implied, meaning that the source data must have the same column order, number of columns, and data format as the target table.

The default source-to-target mapping is based on a match of column names as defined in this section and the column names in the target TABLE. This default mapping can be overridden using the columnMapping section.

columnMapping

Header
PfxGp.Mapping

List<Map<String, String>>

If a mapping is specified, it overrides the default source-to-target column mapping. The default source-to-target mapping is based on a match of column names as defined in the source COLUMNS section and the column names of the target TABLE.

A mapping is specified as either: <target_column_name>: <source_column_name> or <target_column_name>: '<expression>'
where expression is any expression that you would specify in the SELECT list of a query, such as a constant value, a column reference, an operator invocation, a function call, and so on.

Properties

Option Category	Option	Type	Default Value	Description

Option Category	Option	Type	Default Value	Description
Greenplum control file options	`version`	String	1.0.0.1	The version of the gpload control file schema.
	`database`	String		Specifies which database in Greenplum to connect to. (Mandatory)
	`masterHost`	String		Specifies the Greenplum master host name. (Mandatory)
	`masterPort`	Integer		Specifies the Greenplum master port. (Mandatory)
	`dbUser`	String		Specifies which database role to use to connect. (Mandatory)
	`dbPassword`	String		Specifies password of a user to use to connect. (Mandatory)
	`portFrom`	Integer	8000	Together with `portTo` defines the port range that the gpfdist file distribution program should use.
	`portTo`	Integer	9000	Together with `portFrom` defines the port range that the gpfdist file distribution program should use.
	`format`	Enum	CSV	Specifies the format of the source data file(s) – either plain text (TEXT) or comma-separated values (CSV) format. Values: CSV, TEXT
	`delimiter`	String	,	Specifies a single ASCII character that separates columns within each row (line) of data. The default is a tab character in the TEXT mode, a comma in the CSV mode. You can also specify a non-printable ASCII character or a non-printable Unicode character, for example: "\x1B" or "\u001B". The escape string syntax, E'<character-code>', is also supported for non-printable characters. The ASCII or Unicode character must be enclosed in single quotes. For example: E'\x1B' or E'\u001B'.
	`escape`	String	\	Specifies a single character that is used for C escape sequences (such as \n,\t,\100, and so on) and for escaping data characters that might otherwise be taken as row or column delimiters. Make sure to choose an escape character that is not used anywhere in your actual column data. It is also possible to disable escaping in text-formatted files by specifying the value 'OFF' as the escape value. This is very useful for data such as text-formatted web log data that has many embedded backslashes that are not intended to be escaped.
	`quote`	String	"	Required when `format` is CSV. Specifies the quotation character for the CSV mode.
	`header`	Boolean	false	Specifies that the first line in the data file(s) is a header row (contains the names of the columns) and should not be included as data to be loaded. If using multiple data source files, all files must have a header row. The default is to assume that the input files do not have a header row.
	`forceNotNull`	String array		Names of columns that cannot be null. In the CSV mode, it processes each specified column as though it were quoted and hence not a NULL value. For the default null string in the CSV mode (nothing between two delimiters), this causes missing values to be evaluated as zero-length strings.
	`errorLimit`	Integer	0	Enables the single row error isolation mode for this load operation. When enabled, input rows that have format errors will be discarded provided that the error limit count is not reached on any Greenplum segment instance during input processing. If the error limit is not reached, all good rows will be loaded and any error rows will either be discarded or logged to an error table. The default is to abort the load operation on the first error encountered. Note that single row error isolation only applies to data rows with format errors; for example, extra or missing attributes, attributes of a wrong data type, or invalid client encoding sequences. Constraint errors, such as primary key violations, will still cause the load operation to abort if encountered.
	`logErrors`	Boolean	false	If true, gpload creates an internal error table where rows with formatting errors are logged when running in the single row error isolation mode. You can then examine this error table by using GPDB built-in function `gp_read_error_log()` to see error rows that were not loaded (if any). This could be enabled only if errorLimit is greater than zero.
	`table`	String		The name of the target table to load into. (Mandatory)
	`mode`	Enum	MERGE	There are three available load modes: INSERT – Loads data into the target table using the following method: INSERT INTO target_table SELECT FROM input_data;* UPDATE – Updates `updateColumns` of the target table where the rows have the `matchColumns` attribute values equal to those of the input data, and the optional `updateCondition` is true. MERGE – Inserts new rows and updates the `updateCondition` of the existing rows where the `matchColumns` attribute values are equal to those of the input data, and the optional `updateCondition` is true. New rows are identified when the `matchColumns` value in the source data does not have a corresponding value in the existing data of the target table. In those cases, the entire row from the source file is inserted, not only the MATCH and UPDATE columns. If there are multiple new `matchColumns` values that are the same, only one new row for that value will be inserted. Use `updateCondition` to filter out the rows to discard.
	`matchColumns`	String array		Required if `mode` is UPDATE or MERGE. Specifies the column(s) to use as the join condition for the update. The attribute value in the specified target column(s) must be equal to that of the corresponding source data column(s) in order for the row to be updated in the target table.
	`updateColumns`	String array		Required if `mode` is UPDATE or MERGE. Specifies the column(s) to update for the rows that meet the `matchColumns` criteria and the optional `updateCondition`.
	`updateCondition`	String		Specifies a Boolean condition (similar to what you would declare in a WHERE clause) that must be met in order for a row in the target table to be updated (or inserted in the case of a MERGE).
	`truncate`	Boolean	false	If set to true, gpload will remove all rows in the target table prior to loading it.
	`reuseTables`	Boolean	false	If set to true, gpload will not drop the external table objects and staging table objects it creates. These objects will be reused for future load operations that use the same load specifications. This improves the performance of trickle loads (ongoing small loads to the same target table).
SSH server options	`sshHost`	String		Specifies a hostname of the Greenplum server that will be used for connection via SSH. (Mandatory)
	`sshPort`	Integer	22	Specifies a port on the Greenplum server that will be used for connection via SSH.
	`sshUsername`	String		Specifies a username that will be used for connection via SSH. (Mandatory)
	`sshPassword`	String		Specifies a password that will be used for connection via SSH. (Mandatory)
	`sshDirectory`	String	/	Specifies a starting directory. The folder will be used for upload of the gpload control file. Usage of the user's home directory (or its subdirectory) is recommended. Limitation: The directory must exist – SCP is not able to create directories.

Loaddata Response

pfx-gp:loaddata returns a response that describes the result of data load with the following attributes:

Response Attributes	Description

Response Attributes	Description
`startTime`	Start time of data load.
`schema`	Schema of the destination table.
`table`	Destination table name.
`command`	gpdist raw command string.
`formattingErrors`	Number of data formatting errors.
`rowsUpdated`	Number of updated rows.
`rowsInserted`	Number of inserted rows.
`badRows`	Number of bad rows.
`runningTime`	Running time of data load.
`errors`	List of error messages.
`warnings`	List of warning messages.
`status`	Status of data load. Values: SUCCESS, SUCCESS WITH WARNINGS, FAILED
`unmatchedMessages`	List of unmatched lines of data load process output.

Examples

This example demonstrates usage of the pfx-gp component. The route loads data from CSV files (on the Greenplum server) in /home/gpadmin/data/ into the 'transactions' table in the 'test' database.

Sample Data

The content of the file with the data to be loaded located on the Greenplum server as /home/gpadmin/data/data1.csv

abc1,50.27
abc2,124.56
abc3,87.11
abc4,42.64

Greenplum Table

There is the 'transactions' table in Greenplum that will be used for the data load.

create table transactions(
    transaction_id varchar primary key,
    net_amount numeric
);

Data Loading Route

The route will load data from all files located in /home/gpadmin/data/* into the Greenplum database table.

<routes xmlns="http://camel.apache.org/schema/spring">
 
    <route id="gp-component-test">
 
        <!-- Trigger processing -->
        <from uri="timer:in?repeatCount=1"/>
 
        <!-- Specifies the schema of the source data file(s) -->
        <setHeader headerName="PfxGp.InputColumns">
            <groovy>[ ['transaction_id':'varchar(255)'], ['net_amount': 'numeric'] ]</groovy>
        </setHeader>
 
        <!-- Set path mask for csv files that will be loaded into Greenplum -->
        <setBody><simple>/home/gpadmin/data/*</simple></setBody>
 
        <!-- Execute loaddata -->
        <to uri="pfx-gp:loaddata?database=test&masterHost=localhost&masterPort=5432&dbUser=gpadmin&dbPassword=password&table=Transactions&updateColumns=net_amount&matchColumns=transaction_id&errorLimit=10&logErrors=true&sshHost=127.0.0.1&sshPort=2022&sshUsername=gpadmin&sshPassword=gpadmin&sshDirectory=/home/gpadmin"/>
         
        <!-- Optional - log gpload output -->
        <log message="${body}"/>
 
    </route>
 
</routes>