I am using AWS Glue for the first time to crawl a large json file in a S3 bucket to create a new table schema. I created a new crawler and manually ran it. The crawler job runs without error, but when I check the logs, I get the following EOF Exception notification below.
ERROR : Error java.io.EOFException retrieving file at s3://insurance-transparency-data/2022-09-05_796b7d27-c275-4e37-b4c8-be2e4c0c6eda_Aetna-Life-Insurance-Company.json.gz. Tables created did not infer schemas from this file.
I tried uploading a simple test json file to the same S3 bucket and ran the crawler against it, and it parsed the schema perfectly. So, I don't think it is a problem with the permissions or crawler config.
Any suggestions on how to debug further?
It turns out the EOFException had something to do with the file being gzipped. Saving the uncompressed file to S3 and running the crawler against it worked fine.