Search code examples
google-bigqueryparquetparquet-mr

INT32 type error when scanning parquet federated table. Bug or Expected behavior?


I am using BigQuery to query an external data source (also known as a federated table), where the source data is a hive-partitioned parquet table stored in google cloud storage. I used this guide to define the table.

My first query to test this table looks like the following

SELECT * FROM my_dataset.my_table WHERE year=2019 AND day = "2019-01-01" LIMIT 10

This query fails with the following error

Column visitor_partition has type INT64, which differs from the expected type of INT32

I want to know why this error occurs and how to work around it or solve it. I have performed a bit of investigation using the parquet-tools library to peek inside the internals of my parquet data.

When I run java -jar ./parquet-tools-1.10.0.jar meta test.c000.gz.parquet | grep visitor_partition over one of my parquet files, the following relevant lines are returned

visitor_partition:          OPTIONAL INT64 R:0 D:1
visitor_partition:           INT64 GZIP DO:0 FPO:59420041 SZ:54561/537912/9.86 VC:633590 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE ST:[min: 0, max: 99, num_nulls: 0]

When I run the schema, command, the following relevant schema info comes up optional int64 visitor_partition;

So its clear that in this parquet file, the data for the visitor_partition field is represented as an INT64. So why does BigQuery expect to see an INT32 type here?


Solution

  • Note that, the schema of the external table is inferred from the last file sorted by the file names lexicographically among the list of all files that match the source URI of the table. So any chance that particular Parquet file in your case has a different schema than the one you described, e.g., a INT32 column with DATE logical type for the "visitor_partition" field -- which BigQuery would infer as DATE type.