Search code examples
jsonamazon-web-servicesetlaws-glue

AWS Glue classifier for extracting JSON array values


I have files in S3 with inline JSON per line of the structure:

{ "resources": [{"resourceType":"A","id":"A",...},{...}] }

If I run glue over it, I get "resource: array" as the top level element. I want however the elements of the array to be inspected and used as the top level table elements. All the elements per resources array will have the same schema. So I expect

resourceType: string
id: string
....

Theoretically, a custom JSON classifier should handle this:

$.resources[*]

However, the path is not picked up. So I still get the resources:array as the top level element.

I could now run some pre-processing to extract the array elements myself and write them line per line. However, I want to understand why my path is not working.

UPDATE 1:

It might be something with the JSON that I do not understand (its valid JSON produced via JAVA Jackson). If I remove the outer object with the resources attribute and change the structure to

[{"resourceType":"A","id":"A",...},{...}]

the classifier $[*] should pick the sub-objects up. But I still get array:array as top level element.

UPDATE 2:

Its indeed a formatting issue. If I change the JSON files to

[
  {"resourceType":"A","id":"A",...},{...}
]

$[*] starts to work.

UPDATE 3:

Its however not fixing the issue with $.resources[*] to reformat to

{
    "resources": [
        {"resourceType":"A","id":"A",...},{...}
    ]
}

UPDATE 4:

If I take my file and run it through a Intellij re-format, hence produce a JSON object where all nested elements have line breaks, it also starts working with $.resources[*]. Basically, like in UPDATE 3 just applied down the structure.

{
    "resources": [
        {
          "resourceType":"A",
          "id":"A"
        },
        {
          ...
        }
    ]
}

What bothers me is, that the requirements regarding the structure are still not clear to me, since UPDATE 2 worked, but not UPDATE 3. I also find nowhere in the documentation a formal requirement regarding the JSON structure.

In this sense, I think I got to the conclusion of my own question, but the systematics stay a bit unclear.


Solution

  • To conclude here:

    The issue is related to unclear documented JSON formatting requirements of Glue.

    A normalisation via json.dumps(my_json, separators=(',',':')) produces compact JSON that works for my use case.

    I normalised now the content via a lambda.

    Lambda code as reference for whomever it may help:

        s3 = boto3.client('s3')
    
        paginator = s3.get_paginator('list_objects_v2')
    
        pages = paginator.paginate(Bucket=my_bucket)
        
        for page in pages:
            try:
                contents = page["Contents"]
            except KeyError:
                break
    
            for obj in contents:
                key = obj["Key"]
                obj = s3.get_object(Bucket=my_bucket, Key=key)
                j = json.loads(obj['Body'].read().decode('utf-8'))
                
                new_json = json.dumps(j, separators=(',',':'))
                    
                target = 'nrmlzd/' + key
                    
                s3.put_object(
                    Body=new_json,
                    Bucket=my_bucket,
                    Key= target
                )