Search code examples
t-sqlparquetazure-synapseazure-data-lake-gen2

create external table with datetime/date-column from parquet with Synapse


I try to create an external table from an external file stored in my Azure datalake.
But I struggle with datetime and date columns in parquet-files!

Here is the code with which I try to create the external-Table
PS: I also tried using date and datetime but got errors.. though varchar should yield at least something, right?

create external table [parquet_userdata]
(
    registration_time [varchar] (100) null,
    id [varchar] (10) null,
    first_name [varchar] (100) null,
    last_name [varchar] (100) null,
    email [varchar] (100) null,
    gender [varchar] (100) null,
    ip_address [varchar] (100) null,
    cc [varchar] (100) null,
    country [varchar] (100) null,
    birthday [varchar] (100) null,
    salary [float] null,
    title [varchar] (100) null,
    comments [varchar] (2000) null
)
with(
    -- specify the file in container
    location= 'userdata.parquet',
    data_source = parquet_location,
    file_format = parquetformat
)

This is what I get:
enter image description here

Here is the data that I used (I took userdata1.parquet):
https://github.com/Teradata/kylo/tree/master/samples/sample-data/parquet

In case of doubt about the usability of the data, here we can quickly validate the content of the data and what it should look like:
https://www.parquet-viewer.com/#parquet-online

enter image description here


Solution

  • Here, the external table your creating using parquet files is sensitive to column names. You need to give exact names.

    Below is the result got for this command.

    create  external  table [parquet_userdata_c]
    (
        c1 [varchar] (1000) null,
        c2 [varchar] (10) null,
        c3 [varchar] (100) null,
        c4 [varchar] (100) null,
        c5 [varchar] (100) null,
        c6 [varchar] (100) null,
        c7 [varchar] (100) null,
        c8 [varchar] (100) null,
        c9 [varchar] (100) null,
        c10 [varchar] (1000) null,
        c11 [float] null,
        c12 [varchar] (100) null,
        c13 [varchar] (2000) null
    )
    with(
    location='/data/Databricks/userdata1.parquet',
    data_source = parquet_location,
    file_format = user_file_format
    );
    

    Output:

    enter image description here

    So, after adding exact names.

    create  external  table [userdata]
    (
        registration_dttm DATETIME2(7) null,
        id [varchar] (10) null,
        first_name [varchar] (100) null,
        last_name [varchar] (100) null,
        email [varchar] (100) null,
        gender [varchar] (100) null,
        ip_address [varchar] (100) null,
        cc [varchar] (100) null,
        country [varchar] (100) null,
        birthdate [varchar] (1000) null,
        salary [float] null,
        title [varchar] (100) null,
        comments [varchar] (2000) null
    )
    with(
        location='/data/Databricks/userdata1.parquet',
        data_source = parquet_location,
        file_format = user_file_format
        );
    

    Output:

    enter image description here

    If you don't now the column name, you check with the OPENROWSET as below.

    SELECT
        TOP  5 *
    FROM
    OPENROWSET(
        BULK  'https://jgsblob.blob.core.windows.net/data/Databricks/userdata1.parquet',
        FORMAT = 'PARQUET'
    ) AS [result]
    

    enter image description here