I try to create an external table from an external file stored in my Azure datalake.
But I struggle with datetime and date columns in parquet-files!
Here is the code with which I try to create the external-Table
PS: I also tried using date and datetime but got errors.. though varchar should yield at least something, right?
create external table [parquet_userdata]
(
registration_time [varchar] (100) null,
id [varchar] (10) null,
first_name [varchar] (100) null,
last_name [varchar] (100) null,
email [varchar] (100) null,
gender [varchar] (100) null,
ip_address [varchar] (100) null,
cc [varchar] (100) null,
country [varchar] (100) null,
birthday [varchar] (100) null,
salary [float] null,
title [varchar] (100) null,
comments [varchar] (2000) null
)
with(
-- specify the file in container
location= 'userdata.parquet',
data_source = parquet_location,
file_format = parquetformat
)
Here is the data that I used (I took userdata1.parquet):
https://github.com/Teradata/kylo/tree/master/samples/sample-data/parquet
In case of doubt about the usability of the data, here we can quickly validate the content of the data and what it should look like:
https://www.parquet-viewer.com/#parquet-online
Here, the external table your creating using parquet files is sensitive to column names. You need to give exact names.
Below is the result got for this command.
create external table [parquet_userdata_c]
(
c1 [varchar] (1000) null,
c2 [varchar] (10) null,
c3 [varchar] (100) null,
c4 [varchar] (100) null,
c5 [varchar] (100) null,
c6 [varchar] (100) null,
c7 [varchar] (100) null,
c8 [varchar] (100) null,
c9 [varchar] (100) null,
c10 [varchar] (1000) null,
c11 [float] null,
c12 [varchar] (100) null,
c13 [varchar] (2000) null
)
with(
location='/data/Databricks/userdata1.parquet',
data_source = parquet_location,
file_format = user_file_format
);
Output:
So, after adding exact names.
create external table [userdata]
(
registration_dttm DATETIME2(7) null,
id [varchar] (10) null,
first_name [varchar] (100) null,
last_name [varchar] (100) null,
email [varchar] (100) null,
gender [varchar] (100) null,
ip_address [varchar] (100) null,
cc [varchar] (100) null,
country [varchar] (100) null,
birthdate [varchar] (1000) null,
salary [float] null,
title [varchar] (100) null,
comments [varchar] (2000) null
)
with(
location='/data/Databricks/userdata1.parquet',
data_source = parquet_location,
file_format = user_file_format
);
Output:
If you don't now the column name, you check with the OPENROWSET
as below.
SELECT
TOP 5 *
FROM
OPENROWSET(
BULK 'https://jgsblob.blob.core.windows.net/data/Databricks/userdata1.parquet',
FORMAT = 'PARQUET'
) AS [result]