Search code examples
jsoncsvhiveaws-glueamazon-athena

JSON / struct column type in AWS GLUE + AWS Athena / Hive?


I have a CSV file created from nested JSON. It has both regular type columns (e.g. int, string), as well as JSON columns, created from nested JSONs:

attributes;business_id;categories;city;days_open;latitude;longitude;name;review_count;stars;state
{"AcceptsInsurance": False, "AgesAllowed": "allages", "Alcohol": "beer_and_wine", "Ambience": {"casual": True, "classy": False, "divey": False, "hipster": False, "intimate": False, "romantic": False, "touristy": False, "trendy": False, "upscale": False}, "BYOB": False, "BikeParking": True, "BusinessAcceptsBitcoin": False, "BusinessAcceptsCreditCards": True, "BusinessParking": {"garage": False, "lot": False, "street": True, "valet": False, "validated": False}, "ByAppointmentOnly": False, "Caters": True, "CoatCheck": False, "Corkage": False, "DogsAllowed": False, "DriveThru": False, "GoodForDancing": False, "GoodForKids": False, "GoodForMeal": {"breakfast": False, "brunch": False, "dessert": False, "dinner": False, "latenight": False, "lunch": False}, "HappyHour": True, "HasTV": True, "Music": None, "NoiseLevel": "average", "Open24Hours": False, "OutdoorSeating": True, "RestaurantsAttire": "casual", "RestaurantsCounterService": False, "RestaurantsDelivery": False, "RestaurantsGoodForGroups": True, "RestaurantsPriceRange": 2, "RestaurantsReservations": False, "RestaurantsTableService": True, "RestaurantsTakeOut": True, "Smoking": "no", "WheelchairAccessible": True, "WiFi": "free"};6iYb2HFDywm3zjuRg0shjw;["Gastropubs", "Food", "Beer Gardens", "Restaurants", "Bars", "American (Traditional)", "Beer Bar", "Nightlife", "Breweries"];Boulder;["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"];40.0175444;-105.2833481;Oskar Blues Taproom;86;4.0;CO
{"AcceptsInsurance": False, "AgesAllowed": "allages", "Alcohol": "beer_and_wine", "Ambience": {"casual": True, "classy": False, "divey": False, "hipster": False, "intimate": False, "romantic": False, "touristy": False, "trendy": False, "upscale": False}, "BYOB": False, "BikeParking": False, "BusinessAcceptsBitcoin": False, "BusinessAcceptsCreditCards": True, "BusinessParking": {"garage": True, "lot": False, "street": False, "valet": False, "validated": False}, "ByAppointmentOnly": False, "Caters": True, "CoatCheck": False, "Corkage": False, "DogsAllowed": False, "DriveThru": False, "GoodForDancing": False, "GoodForKids": True, "GoodForMeal": {"breakfast": True, "brunch": False, "dessert": False, "dinner": False, "latenight": False, "lunch": True}, "HappyHour": False, "HasTV": False, "Music": None, "NoiseLevel": "average", "Open24Hours": False, "OutdoorSeating": False, "RestaurantsAttire": "casual", "RestaurantsCounterService": False, "RestaurantsDelivery": False, "RestaurantsGoodForGroups": False, "RestaurantsPriceRange": 2, "RestaurantsReservations": False, "RestaurantsTableService": True, "RestaurantsTakeOut": True, "Smoking": "no", "WheelchairAccessible": False, "WiFi": "free"};tCbdrRPZA0oiIYSmHG3J0w;["Salad", "Soup", "Sandwiches", "Delis", "Restaurants", "Cafes", "Vegetarian"];Portland;["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"];45.5889058992;-122.5933307507;Flying Elephants at PDX;126;4.0;OR
{"AcceptsInsurance": False, "AgesAllowed": "allages", "Alcohol": "none", "Ambience": None, "BYOB": False, "BikeParking": False, "BusinessAcceptsBitcoin": False, "BusinessAcceptsCreditCards": True, "BusinessParking": {"garage": False, "lot": False, "street": True, "valet": False, "validated": False}, "ByAppointmentOnly": False, "Caters": False, "CoatCheck": False, "Corkage": False, "DogsAllowed": False, "DriveThru": False, "GoodForDancing": False, "GoodForKids": False, "GoodForMeal": None, "HappyHour": False, "HasTV": False, "Music": None, "NoiseLevel": "average", "Open24Hours": False, "OutdoorSeating": False, "RestaurantsAttire": "casual", "RestaurantsCounterService": False, "RestaurantsDelivery": False, "RestaurantsGoodForGroups": False, "RestaurantsPriceRange": 2, "RestaurantsReservations": True, "RestaurantsTableService": True, "RestaurantsTakeOut": False, "Smoking": "no", "WheelchairAccessible": False, "WiFi": "no"};bvN78flM8NLprQ1a1y5dRg;["Antiques", "Fashion", "Used", "Vintage & Consignment", "Shopping", "Furniture Stores", "Home & Garden"];Portland;["Thursday", "Friday", "Saturday", "Sunday"];45.5119069956;-122.6136928797;The Reclaimory;13;4.5;OR

Can this file be processed with AWS Glue to be input for AWS Athena / Hive (which is used internally in Athena)? In particular, how can I specify the data type for JSON columns? Do I have to do this by hand? Are the JSONs written OK, or should they be reformatted?


Solution

  • Will try to answer all of your questions.
    Can this file be processed with AWS Glue to be input for AWS Athena / Hive (which is used internally in Athena)?
    Should be. if you structure hive table properly, any csv file can be uploaded there.

    how can I specify the data type for JSON columns?
    in hive you can store as string. Looking at your json structure, you can easily access elements using expression like this - get_json_object(json_col_str,'$.BusinessParking.garage').

    Do I have to do this by hand?
    I guess so unless you have some auto DDL creation utility. You can put sample rows in xl and find out the table structure easily.

    Are the JSONs written OK, or should they be reformatted?
    From the example you gave, i checked last row and json object seems fine to me. I also checked using https://jsonformatter.curiousconcept.com/ which validates and formats it in a pretty format. You can use it in case of any discrepancies.