google-cloud-platform pyspark google-cloud-storage gzip google-cloud-dataproc

Pyspark on GCP Dataproc - Partial reading of data for gzip encoded Cloud Storage files

I have a Workflow template in Google Dataproc that reads schema from json gzip compressed files in Google Cloud Storage, containing the following headers (thus eligible to decompressive transcoding):

Content-Encoding:       gzip
Content-Type:           application/json

I added the following option, found here in the Hadoop connector to make sure that my workflow accepts gzip encoded files:

gcloud dataproc workflow-templates set-managed-cluster $WORKFLOW_NAME \
        ...
        --properties=core:fs.gs.inputstream.support.gzip.encoding.enable=true

I then read the GCS files from my schema with this pyspark line:

decodedDF = spark.read.schema(globalSchema).option("header", True).json(<list-of-GCS-folders-containing-json-files>)

So far so good: the workflow runs fine without error logs.

However, the output of my pyspark job is different than when I run exactly the same workflow with uncompressed GCS files.

It seems (although I cannot confirm) that it processes only partially the file contents (approximately 10 % of the data according to sum of values) but looks coherent when checking average values. Changing gzip compression level of the json files gives different results as well.

How is that possible and how can I fix the Dataproc workflow-template to have the correct results?

Solution

Issue is linked to this GitHub issue.

To sum-up the thread, automatic decompression based on HTTP header Content-Encoding is not compatible with Hadoop file system.

The solution is to remove Content-Encoding header from GCS files, suffix them with .gz and set Content-Type: application/gzip.