Search code examples
openrefinegoogle-refine

How to save only specific JSON elements in a new OpenRefine column


{
    "business_id": "SQ0j7bgSTazkVQlF5AnqyQ",
    "full_address": "214 E Main St\nCarnegie\nCarnegie, PA 15106",
    "hours": {},
    "open": true,
    ** "categories": ["Chinese", "Restaurants"] ** ,
    "city": "Carnegie",
    "review_count": 9,
    "name": "Don Don Chinese Restaurant",
    "neighborhoods": ["Carnegie"],
    "longitude": -80.0849615,
    "state": "PA",
    "stars": 2.5,
    "latitude": 40.4083473,
    "attributes": {
        "Take-out": true,
        "Alcohol": "none",
        "Noise Level": "quiet",
        "Parking": {
            "garage": false,
            "street": false,
            "validated": false,
            "lot": false,
            "valet": false
        },
        "Delivery": true,
        "Has TV": true,
        "Outdoor Seating": false,
        "Attire": "casual",
        "Waiter Service": false,
        "Accepts Credit Cards": true,
        "Good for Kids": true,
        "Good For Groups": false,
        "Price Range": 1
    },
    "type": "business"
}

value.parseJson()['categories'] will create a new column called 'categories' in OpenRefine, but is it possible to filter and keep 'chinese' as the only value and remove any other values?


Solution

  • In the example above, the GREL expression:

    value.parseJson()['categories']
    

    results in an array containing two values:

    ["Chinese", "Restaurants"]
    

    You can manipulate this with GREL expressions that act on arrays. For example, to select the first value in the array you could use:

    value.parseJson()['categories'][0]
    

    Which would select the first entry in the array (increase the number in square brackets at the end of the expression to select other entries in the array)

    If you want to filter on a specific value in the array you could use the 'filter' expression:

    filter(value.parseJson()['categories'],v,v=="Chinese")
    

    This would result in a new array with only the word "Chinese" in it (in the above example). To store this in the new column, you need to convert the array into a string:

    filter(value.parseJson()['categories'],v,v=="Chinese").join("")
    

    To avoid issues with case-sensitivity, and the possibility of 'Chinese' appearing more than once in the 'categories' array, I'd convert the values to lowercase first and de-duplicate the array before converting to a string - so you end up with:

    filter(forEach(value.parseJson()["categories"],v,v.toLowercase()),w,w=="chinese").uniques().join("")