Search code examples
jsonamazon-web-servicesamazon-s3aws-glueamazon-athena

AWS Glue Crawler cannot parse large files (classification UNKNOWN)


I've been working on trying to use the crawler from AWS Glue to try to obtain the columns and other features of a certain json file.

I've parsed the json file locally by converting it to UTF-8 and using boto3 to move it into an s3 container and accessing that container from the crawler.

I created a json classifier with the custom classifier $[*] and created a crawler with normal settings.

When I do this with a file that is relatively small (<50 Kb) the crawler correctly identifies the columns as well as the internal schema of the inner json layers within the main json. However, the file that I am trying to do with (around 1 Gb), the crawler has "UNKNOWN" as the classification and cannot identify any columns and thus I cannot query it.

Any ideas for the issue or some kind of work around?

I am ultimately trying to convert it to a Parquet format and doing some querying with Athena.

I've looked at the following post but this solution did not work. I've already tried rewriting my classifier and crawler. I also presume that these are not the core problems because I used $[*] as my custom classifier and used practically identical settings while trying to do this with the smaller file with the same result.

I'm beginning to think that the reason is just because of the large file size.


Solution

  • The following is the fix that I ended up using.

    Found that the AWS Glue crawler likes jsons separated by commas (no outer array brackets).

    For example, if you had a large file like in the following format:

    [
      {},
      {},
      {},...
    ]
    

    You can manually remove the last and first character with something like str[1:-1] giving you:

    {}
    {}
    {}...
    

    I ended up splitting up the file into smaller pieces (between 10-50 MB from the original 1 GB file), and the crawler seemed to be okay with that.