We have an Azure Time Series Insights Preview instance connected to an event hub. The incoming events are written to the related cold storage data account as parquet files. When I try to open the parquet file with various readers (like the parquet-[head|cat|etc] cmd tools) I get errors.
Output of parquet-head
org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file file:20200123140854700_c8876d10_01.parquet
Here is a sample of the issue in more detail. This is the output of parquet-dump
$ parquet-dump 20200123140854700_c8876d10_01.parquet
row group 0 ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- timestamp: INT64 SNAPPY DO:0 FPO:4 SZ:100/850/8.50 VC:100 ENC:PLAIN,RLE ST:[min: 2020-01-23T14:08:52.583+0000, max: 2020-01-23T14:08:52.583+0000, num_nulls: 0] id_string: BINARY SNAPPY DO:167 FPO:194 SZ:80/76/0.95 VC:100 ENC:PLAIN_DICTIONARY,PLAIN,RLE ST:[min: dabas96, max: dabas96, num_nulls: 0] dabasuploader_time_string: BINARY SNAPPY DO:313 FPO:855 SZ:705/2177/3.09 VC:100 ENC:PLAIN_DICTIONARY,PLAIN,RLE ST:[num_nulls: 0, min/max not defined] dabasuploader_prod_kwh_string: BINARY SNAPPY DO:1118 FPO:1139 SZ:62/58/0.94 VC:100 ENC:PLAIN_DICTIONARY,PLAIN,RLE ST:[min: 0, max: 0, num_nulls: 0] dabasuploader_pred_nxd_kwh_string: BINARY SNAPPY DO:1252 FPO:1488 SZ:319/390/1.22 VC:100 ENC:PLAIN_DICTIONARY,PLAIN,RLE ST:[num_nulls: 0, min/max not defined] dabasuploader_pred_today_kwh_string: BINARY SNAPPY DO:1650 FPO:1903 SZ:336/404/1.20 VC:100 ENC:PLAIN_DICTIONARY,PLAIN,RLE ST:[num_nulls: 0, min/max not defined] java.lang.IllegalArgumentException: [solpos_altitude_double] optional double solpos_altitude_double is not in the store: [[dabasuploader_time_string] optional binary dabasuploader_time_string (STRING), [dabasuploader_pred_nxd_kwh_string] optional binary dabasuploader_pred_nxd_kwh_string (STRING), [id_string] optional binary id_string (STRING), [timestamp] optional int64 timestamp (TIMESTAMP(MILLIS,true)), [dabasuploader_pred_today_kwh_string] optional binary dabasuploader_pred_today_kwh_string (STRING), [dabasuploader_prod_kwh_string] optional binary dabasuploader_prod_kwh_string (STRING)] 100
The solpos_altitude_double
is coming from the events we upload to the eventhub. I mean, we call that solpos_altitude
. The _double
postfix is coming from TSI, according to the docs.
According to all MS Azure documentations I could find, reading the parquet file should be possible without issues.
Does anybody know what went wrong? If more info is needed, I am more than happy to provide.
I believe this is a known issue caused by changing the schema of the data (drifting schema). We're currently working on a fix for it.