SQL External Table Functions

SQL External Table Functions are built-in table functions used to read data from a source external to the database. The table functions are used in SQL statements where external data is used in-place of a regular SQL table or subquery. The location of the data files can be on the local file system, or remote, such as Amazon S3.

The result set of a table function is a temporary table that exists during SQL statement execution. This table can be queried further, joined with other tables, or used in any other way that a regular table can be used.

External Data Source

The following file formats are supported:

  • CSV

  • Parquet

If the source data is not formatted correctly, the data may not be imported as expected, or may fail completely.

Local File Access Configuration

An external data directory must be configured before accessing local files. Once configured, only files located within this directory may be read using the READ_CSV and READ_PARQUET table functions. The external data directory can be configured using the external-data-dir database option. The path must be an absolute path to a directory on the local file system. An external data directory can be configured for each TE.

Remote File Access Configuration

Remote files can be read directly via HTTPS or Amazon S3. Accessing files via HTTPS does not require any configuration. Accessing files located on Amazon S3 requires configuration of credentials.

The following S3 credentials are required:

  • S3_ACCESS_KEY_ID

  • S3_REGION

  • S3_SECRET_ACCESS_KEY

Use SET to configure the Access Key ID, Region, and Secret Access Key. For more information, see SET.

Amazon S3 credentials can be configured for each connection. The pseudo table SYSTEM.CONNECTIONPROPERTIES can be used to check if credentials are configured.

READ_CSV

READ_CSV returns the data in the specified CSV file. It takes only one parameter, the relative or absolute path to the CSV file. The path can be local or remote. The path to the file can be relative to external-data-dir, or absolute. If path is absolute, it must be located within the external-data-dir.

The CSV source file must be valid UTF-8.

The data in the CSV file must be of type BOOLEAN, BIGINT, DOUBLE, TIME, DATE, TIMESTAMP, or VARCHAR. Any other data type will be interpreted as VARCHAR.

Syntax

READ_CSV('path/to/the/file.csv')
READ_CSV('s3://path/to/the/file.csv')
READ_CSV('https://path/to/the/file.csv')

Example

SELECT
    ID, NAME, AGE, COUNTRY
FROM
    READ_CSV('people.csv')
WHERE
    ID < 6;
 ID   NAME   AGE   COUNTRY
 --- ------- ---- ---------

  1  Alice    25  USA
  2  Bob      30  Canada
  3  Charlie  35  UK
  4  Diana    40  Germany
  5  Evan     28  France

READ_PARQUET

READ_PARQUET returns the data in the specified parquet file. It takes only one parameter, the relative or absolute path to the parquet file. The path can be local or remote. The path to the file can be relative to external-data-dir, or absolute. If path is absolute, it must be located within the external-data-dir.

Syntax

READ_PARQUET('path/to/the/parquet/file.parquet')
READ_PARQUET('s3://path/to/the/file.parquet')
READ_PARQUET('https://path/to/the/file.parquet')

Example

SELECT
    PRICE, PARKING, FURNISHINGSTATUS
FROM
    READ_PARQUET('docs/test/houseprices.parquet')
WHERE
    AREA > 12000
ORDER BY PRICE;
  PRICE   PARKING  FURNISHINGSTATUS
 -------- -------- -----------------

  3500000    0      unfurnished
  4900000    2      furnished
  5943000    2      semi-furnished
  6790000    2      furnished
  6930000    1      furnished
  9800000    2      furnished
 10150000    0      unfurnished