Search code examples
azureparquetazure-timeseries-insights

Parquet file written by Azure Time Series Insights Preview is not readable


We have an Azure Time Series Insights Preview instance connected to an event hub. The incoming events are written to the related cold storage data account as parquet files. When I try to open the parquet file with various readers (like the parquet-[head|cat|etc] cmd tools) I get errors.

Output of parquet-head

org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file file:20200123140854700_c8876d10_01.parquet

Here is a sample of the issue in more detail. This is the output of parquet-dump

$ parquet-dump 20200123140854700_c8876d10_01.parquet
row group 0 ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- timestamp: INT64 SNAPPY DO:0 FPO:4 SZ:100/850/8.50 VC:100 ENC:PLAIN,RLE ST:[min: 2020-01-23T14:08:52.583+0000, max: 2020-01-23T14:08:52.583+0000, num_nulls: 0] id_string: BINARY SNAPPY DO:167 FPO:194 SZ:80/76/0.95 VC:100 ENC:PLAIN_DICTIONARY,PLAIN,RLE ST:[min: dabas96, max: dabas96, num_nulls: 0] dabasuploader_time_string: BINARY SNAPPY DO:313 FPO:855 SZ:705/2177/3.09 VC:100 ENC:PLAIN_DICTIONARY,PLAIN,RLE ST:[num_nulls: 0, min/max not defined] dabasuploader_prod_kwh_string: BINARY SNAPPY DO:1118 FPO:1139 SZ:62/58/0.94 VC:100 ENC:PLAIN_DICTIONARY,PLAIN,RLE ST:[min: 0, max: 0, num_nulls: 0] dabasuploader_pred_nxd_kwh_string: BINARY SNAPPY DO:1252 FPO:1488 SZ:319/390/1.22 VC:100 ENC:PLAIN_DICTIONARY,PLAIN,RLE ST:[num_nulls: 0, min/max not defined] dabasuploader_pred_today_kwh_string: BINARY SNAPPY DO:1650 FPO:1903 SZ:336/404/1.20 VC:100 ENC:PLAIN_DICTIONARY,PLAIN,RLE ST:[num_nulls: 0, min/max not defined] java.lang.IllegalArgumentException: [solpos_altitude_double] optional double solpos_altitude_double is not in the store: [[dabasuploader_time_string] optional binary dabasuploader_time_string (STRING), [dabasuploader_pred_nxd_kwh_string] optional binary dabasuploader_pred_nxd_kwh_string (STRING), [id_string] optional binary id_string (STRING), [timestamp] optional int64 timestamp (TIMESTAMP(MILLIS,true)), [dabasuploader_pred_today_kwh_string] optional binary dabasuploader_pred_today_kwh_string (STRING), [dabasuploader_prod_kwh_string] optional binary dabasuploader_prod_kwh_string (STRING)] 100

The solpos_altitude_double is coming from the events we upload to the eventhub. I mean, we call that solpos_altitude. The _double postfix is coming from TSI, according to the docs.

According to all MS Azure documentations I could find, reading the parquet file should be possible without issues.

Does anybody know what went wrong? If more info is needed, I am more than happy to provide.


Solution

  • I believe this is a known issue caused by changing the schema of the data (drifting schema). We're currently working on a fix for it.