pfx-gp:loaddata
Loads data into Greenplum. The data must be present as CSV files (possibly gzipped) on the Greenplum server. The component does not solve the transport of the data files to the server.
In the background, the method does three steps:
Generate a control file for the data load.
Upload the control file to the Greenplum server.
Execute the
gpload
binary on the Greenplum server.
To learn more about gpload and control file, see the Greenplum documentation.
Component Input
Input | Location | Type | Description |
---|---|---|---|
| Exchange body | String | (Mandatory) Specifies the location of a file, named pipe, or directory location on the local file system that contains data to be loaded. You can declare more than one file as long as the data is of the same format in all files specified. If the files are compressed using gzip or bzip2 (have a .gz or .bz2 file extension), the files will be uncompressed automatically (provided that gunzip or bunzip2 is in your path). When specifying which source files to load, you can use the wildcard character (*) or other C-style pattern matching to denote multiple files. The files specified are assumed to be relative to the current directory from which gpload is executed (or you can declare an absolute path). |
| Header PfxGp.InputColumns | List<Map<String, String>> | Specifies the schema of the source data file(s) in the format of If the input The default source-to-target mapping is based on a match of column names as defined in this section and the column names in the target TABLE. This default mapping can be overridden using the |
| Header | List<Map<String, String>> | If a mapping is specified, it overrides the default source-to-target column mapping. The default source-to-target mapping is based on a match of column names as defined in the source COLUMNS section and the column names of the target TABLE. A mapping is specified as either: |
Properties
Option Category | Option | Type | Default Value | Description |
---|---|---|---|---|
Greenplum control file options |
| String | 1.0.0.1 | The version of the gpload control file schema. |
| String |
| Specifies which database in Greenplum to connect to. (Mandatory) | |
| String |
| Specifies the Greenplum master host name. (Mandatory) | |
| Integer |
| Specifies the Greenplum master port. (Mandatory) | |
| String |
| Specifies which database role to use to connect. (Mandatory) | |
| String |
| Specifies password of a user to use to connect. (Mandatory) | |
| Integer | 8000 | Together with | |
| Integer | 9000 | Together with | |
| Enum | CSV | Specifies the format of the source data file(s) – either plain text (TEXT) or comma-separated values (CSV) format. Values: CSV, TEXT | |
| String | , | Specifies a single ASCII character that separates columns within each row (line) of data. The default is a tab character in the TEXT mode, a comma in the CSV mode. You can also specify a non-printable ASCII character or a non-printable Unicode character, for example: "\x1B" or "\u001B". The escape string syntax, E'<character-code>', is also supported for non-printable characters. The ASCII or Unicode character must be enclosed in single quotes. For example: E'\x1B' or E'\u001B'. | |
| String | \ | Specifies a single character that is used for C escape sequences (such as \n,\t,\100, and so on) and for escaping data characters that might otherwise be taken as row or column delimiters. Make sure to choose an escape character that is not used anywhere in your actual column data. It is also possible to disable escaping in text-formatted files by specifying the value 'OFF' as the escape value. This is very useful for data such as text-formatted web log data that has many embedded backslashes that are not intended to be escaped. | |
| String | " | Required when | |
| Boolean | false | Specifies that the first line in the data file(s) is a header row (contains the names of the columns) and should not be included as data to be loaded. If using multiple data source files, all files must have a header row. The default is to assume that the input files do not have a header row. | |
| String array |
| Names of columns that cannot be null. In the CSV mode, it processes each specified column as though it were quoted and hence not a NULL value. For the default null string in the CSV mode (nothing between two delimiters), this causes missing values to be evaluated as zero-length strings. | |
| Integer | 0 | Enables the single row error isolation mode for this load operation. When enabled, input rows that have format errors will be discarded provided that the error limit count is not reached on any Greenplum segment instance during input processing. If the error limit is not reached, all good rows will be loaded and any error rows will either be discarded or logged to an error table. The default is to abort the load operation on the first error encountered. Note that single row error isolation only applies to data rows with format errors; for example, extra or missing attributes, attributes of a wrong data type, or invalid client encoding sequences. Constraint errors, such as primary key violations, will still cause the load operation to abort if encountered. | |
| Boolean | false | If true, gpload creates an internal error table where rows with formatting errors are logged when running in the single row error isolation mode. You can then examine this error table by using GPDB built-in function | |
| String |
| The name of the target table to load into. (Mandatory) | |
| Enum | MERGE | There are three available load modes:
| |
| String array |
| Required if | |
| String array |
| Required if | |
| String |
| Specifies a Boolean condition (similar to what you would declare in a WHERE clause) that must be met in order for a row in the target table to be updated (or inserted in the case of a MERGE). | |
| Boolean | false | If set to true, gpload will remove all rows in the target table prior to loading it. | |
| Boolean | false | If set to true, gpload will not drop the external table objects and staging table objects it creates. These objects will be reused for future load operations that use the same load specifications. This improves the performance of trickle loads (ongoing small loads to the same target table). | |
SSH server options |
| String |
| Specifies a hostname of the Greenplum server that will be used for connection via SSH. (Mandatory) |
| Integer | 22 | Specifies a port on the Greenplum server that will be used for connection via SSH. | |
| String |
| Specifies a username that will be used for connection via SSH. (Mandatory) | |
| String |
| Specifies a password that will be used for connection via SSH. (Mandatory) | |
| String | / | Specifies a starting directory. The folder will be used for upload of the gpload control file. Usage of the user's home directory (or its subdirectory) is recommended. Limitation: The directory must exist – SCP is not able to create directories. |
Loaddata Response
pfx-gp:loaddata returns a response that describes the result of data load with the following attributes:
Response Attributes | Description |
---|---|
| Start time of data load. |
| Schema of the destination table. |
| Destination table name. |
| gpdist raw command string. |
| Number of data formatting errors. |
| Number of updated rows. |
| Number of inserted rows. |
| Number of bad rows. |
| Running time of data load. |
| List of error messages. |
| List of warning messages. |
| Status of data load. Values: SUCCESS, SUCCESS WITH WARNINGS, FAILED |
| List of unmatched lines of data load process output. |
Examples
This example demonstrates usage of the pfx-gp component. The route loads data from CSV files (on the Greenplum server) in /home/gpadmin/data/ into the 'transactions' table in the 'test' database.
Sample Data
The content of the file with the data to be loaded located on the Greenplum server as /home/gpadmin/data/data1.csv
abc1,50.27
abc2,124.56
abc3,87.11
abc4,42.64
Greenplum Table
There is the 'transactions' table in Greenplum that will be used for the data load.
create table transactions(
transaction_id varchar primary key,
net_amount numeric
);
Data Loading Route
The route will load data from all files located in /home/gpadmin/data/* into the Greenplum database table.
<routes xmlns="http://camel.apache.org/schema/spring">
<route id="gp-component-test">
<!-- Trigger processing -->
<from uri="timer:in?repeatCount=1"/>
<!-- Specifies the schema of the source data file(s) -->
<setHeader headerName="PfxGp.InputColumns">
<groovy>[ ['transaction_id':'varchar(255)'], ['net_amount': 'numeric'] ]</groovy>
</setHeader>
<!-- Set path mask for csv files that will be loaded into Greenplum -->
<setBody><simple>/home/gpadmin/data/*</simple></setBody>
<!-- Execute loaddata -->
<to uri="pfx-gp:loaddata?database=test&masterHost=localhost&masterPort=5432&dbUser=gpadmin&dbPassword=password&table=Transactions&updateColumns=net_amount&matchColumns=transaction_id&errorLimit=10&logErrors=true&sshHost=127.0.0.1&sshPort=2022&sshUsername=gpadmin&sshPassword=gpadmin&sshDirectory=/home/gpadmin"/>
<!-- Optional - log gpload output -->
<log message="${body}"/>
</route>
</routes>
IntegrationManager version 5.8.0