Search code examples
apache-sparkpysparkaws-glue

PySpark read gzip of multiple json file Failed


I have no issue in reading a standalone JSON file which consists of a single line in the format of {xxx...}. However, when I compress it using tar -zcvf into one-file.json.gz and attempt to read it, I receive a single column named

root
 |-- _corrupt_record: string (nullable = true)

This code to read that gzip file.

df = (
    spark.read.option("recursiveFileLookup", "true")
    .json("../../one-file.json.gz")
)

When I try to use Aws Glue, it have some exceptions like:

"Failure Reason": "Unable to parse file: one-file.json.gz\n"

I want to know what's wrong here.


Solution

  • The reason is simple. The spark can only read json format data and .json.gz is not json. You have to untar the file before it is read by spark. You can use the tarfile module to do it like:

    import tarfile
    tar_file = tarfile.open("*.tar.gz")
    tar_file.extractall()
    spark.read.option("recursiveFileLookup", "true").json(tar_file)
    tar_file .close()