Search code examples
amazon-web-servicesaws-glue

AWS glue: ignoring spaces in JSON properties


I have a dataset with JSON files in it. Some of the entries of these JSONs have spaces in the entries like

{
    'propertyOne': 'something',
    'property Two': 'something'
}

I've had this data set crawled by several different crawlers to try and get the schema I want. For some reason on one of my crawls, the spaces were removed, but on trying to replicate the process, I cannot get the spaces to be removed and when querying in Athena I get this error

HIVE_METASTORE_ERROR: : expected at position x in 'some string' but ' ' found instead.

Position x is the position of the space between 'property' and 'Two' in the JSON entry.

I would like to just be able to exclude this field or have the space removed when crawled, but I'm not sure how. I can't change the JSON format. Any help is appreiated


Solution

  • I believe your only option, in this case, would be to create your own custom JSON classifier to select only those attributes you want the Crawler to add to the Data Catalog.

    I.e. if you want to only retrieve propertyOne you can use specify the JSONPath expression as $.propertyOne.

    Note also that your JSON should be in double quotes, the single quotes could also be causing issues when parsing the data.