I've been playing around with the lake databases in Azure Synapse and have been trying to export data from my dedicated SQL pool using CETAS method to the datalake then using a lake database to map/read the parquet files that were created.
The problem comes when I try to query the table created in the lake database, it just returns no results, with an error listed 'no dataset found for myexampletable'.
I've checked and the files are in the correct location in the datalake which match the location of the table specified in the lake database. The compression used is 'snappy' when I'm exporting from the dedicated SQL pool and I've also set the compression to 'snappy' in the lake database table settings. Note that I've tried using uncompressed parquet format in both the SQL pool and the lake database, same error...
The only thing I've noticed is that the files in the datalake are created with an extension '????.parq.snappy' or '????.parq' as opposed to the usual '????.parquet' for example. Not sure if this just the dedicated SQL pool's way of naming the files?
I tried renaming them to '????.parquet' and this fixes the problem. The question is how do we get the lakehouse to read '.parq' or '.parq.snappy' files? Or is there a way to get Synapse dedicated SQL pool to export using the 'normal' file extension?
Any ideas how to fix this?
Edit: added example code I'm using (names changed of course)
CREATE EXTERNAL DATA SOURCE [SomeExternalDataSource] WITH (
TYPE = HADOOP
,LOCATION = N'abfss://somecontainer@somestorageaccount.dfs.core.windows.net/SomeFolder'
)
CREATE EXTERNAL FILE FORMAT [CompressedParquetFormat] WITH (
FORMAT_TYPE = PARQUET
,DATA_COMPRESSION = N'org.apache.hadoop.io.compress.SnappyCodec'
)
CREATE EXTERNAL TABLE [dbo].[SomeExternalTable] WITH (
LOCATION = 'FolderOne/FolderTwo/Etc'
,DATA_SOURCE = [SomeExternalDataSource]
,FILE_FORMAT = [CompressedParquetFormat]
) AS
SELECT
[Field1]
,[Field2]
,[and_so_on]
FROM
[SomeSchema].[SomeTable]
Just in case anyone else runs into this problem...
The problem was the fact that the dedicated SQL pool was exporting the parquet files using the '.parq' file extension instead of the more common/standard '.parquet' extension. The lake database was of course looking for '.parquet' files (not '.parq' files), so even though the files were there it was not finding any, hence empty tables.
In short, the lake databases and serverless pool are built to use '.parquet' file extensions, but the dedicated SQL pool writes '.parq' files. Why on Earth would anyone want file extensions to be consistent in the same 'product', lol!?! Insert face palm here...
I've reported this to Microsoft and they've confirmed you cannot change the file extension to the normal/standard '.parquet' extension, you're stuck with the file names and extension it generates. I've asked if they will update (or fix ;-) SQL dedicated pools regarding this 'issue' in the future, but I'm still awaiting reply...
As a workaround I'm still exporting from dedicated SQL pool but then I have to change the file extensions afterwards (i.e. renaming files using the API). Once that's done they get picked up by the lake database... doh!
Here's the link to my post reporting this to Microsoft and their replies