I have no issue in reading a standalone JSON file which consists of a single line in the format of {xxx...}
. However, when I compress it using tar -zcvf
into one-file.json.gz and attempt to read it, I receive a single column named
root
|-- _corrupt_record: string (nullable = true)
This code to read that gzip file.
df = (
spark.read.option("recursiveFileLookup", "true")
.json("../../one-file.json.gz")
)
When I try to use Aws Glue, it have some exceptions like:
"Failure Reason": "Unable to parse file: one-file.json.gz\n"
I want to know what's wrong here.
The reason is simple. The spark can only read json format data and .json.gz is not json. You have to untar the file before it is read by spark. You can use the tarfile module to do it like:
import tarfile
tar_file = tarfile.open("*.tar.gz")
tar_file.extractall()
spark.read.option("recursiveFileLookup", "true").json(tar_file)
tar_file .close()