Search code examples
jsonhadoopapache-pigelephantbird

Parsing complex nested JSON in Pig


I want to parse a Billionaires JSON dataset into Pig.The JSON file can be found here.

Here is what each entry has:

{
    "wealth": {
        "worth in billions": 1.2,
        "how": {
             "category": "Resource Related",
             "from emerging": true,
             "industry": "Mining and metals",
             "was political": false,
             "inherited": true,
             "was founder": true
         },
         "type": "privatized and resources"
    },
    "company": {
        "sector": "aluminum",
        "founded": 1993,
        "type": "privatization",
        "name": "Guangdong Dongyangguang Aluminum",
        "relationship": "owner"
        },
    "rank": 1372,
    "location": {
          "gdp": 0.0,
          "region": "East Asia",
          "citizenship": "China",
          "country code": "CHN"
              },
    "year": 2014,
    "demographics": {
              "gender": "male",
              "age": 50
              },
    "name": "Zhang Zhongneng"
}

Attempt 1

I tried loading this data using the following command in grunt :

billionaires = LOAD 'billionaires.json' USING JsonLoader('wealth: (worth in billions:double, how: (category:chararray, from emerging:chararray, industry:chararray, was political:chararray, inherited:chararray, was founder:chararray), type:chararray), company: (sector:chararray,founded:int,type:chararray,name:chararray,relationship:chararray),rank:int,location:(gdp:double,region:chararray,citizenship:chararray,country code:chararray), year:int, demographics: (gender:chararray,age:int), name:chararray');

This however gives me the error:

ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: mismatched input 'in' expecting RIGHT_PAREN

Attempt 2

Next I tried using Twitter's elephantbird project's loader called com.twitter.elephantbird.pig.load.JsonLoader. Here is the code for this UDF. This is what I did:

billionaires = LOAD 'billionaires.json' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') AS (json:map[]);
names = foreach billionaires generate json#'name' AS name;
dump names;

Now this runs and I get no errors! But nothing gets displayed. I get an output like:

Input(s): Successfully read 0 records (1445335 bytes) from: "hdfs://localhost:9000/user/purak/billionaires.json"

Output(s): Successfully stored 0 records in: "hdfs://localhost:9000/tmp/temp-1399280624/tmp-477607570"

Counters: Total records written : 0 Total bytes written : 0 Spillable Memory Manager spill count : 0 Total bags proactively spilled: 0 Total records proactively spilled: 0

Job DAG: job_1478889184960_0005

What am I doing wrong here?


Solution

  • This is probably not the best way of doing this but this is what I end up doing :

    1. Remove spaces from the field names : I replaced fields like "worth in billions", "from emerging" etc with "worth_in_billions", "from_emerging" etc in the json dataset. (I did a simple 'find and replace' for this)

    2. Comma delimited json to newline delimited json : The json file that I had was of the form [{"_comment":"first entry" ...},{"_comment":"second entry" ...}]. But JsonLoader in Pig takes each newline to be a new entry. To make the json file newline delimited instead of comma I used js which is a command-line JSON processor. Install it using sudo apt-get install js and run cat billionaires.json | jq -c ".[]" > newBillionaires.json.

    3. The newBillionaires.json file now has each entry on new line. Now load this file into Pig using:

      copyFromLocal /home/purak/Desktop/newBillionaires.json /user/purak

    billionaires = LOAD 'newBillionaires.json' USING JsonLoader('name:chararray, demographics: (age:int,gender:chararray),year:int,location:(country_code:chararray,citizenship:chararray,region:chararray,gdp:double),rank:int,company: (relationship:chararray,name:chararray,type:chararray,founded:int,sector:chararray), wealth:(type:chararray,how:(was_founder:chararray,inherited:chararray,was_political:chararray,industry:chararray, from_emerging:chararray,category:chararray),worth_in_biilions:double)');

    Note: Using js reversed the order of fields in each entry. Hence in the load command all fields are in a reversed order compared to the load command in the question.

    1. You can now unnest each tuples using :

    billionairesFinal = foreach billionaires generate name, demographics.age as age, demographics.gender as gender, year, location.country_code as countryCode, location.citizenship as citizenship, location.region as region, location.gdp as gdp, rank, company.relationship as companyRelationship, company.name as companyName, company.type as companyType, company.founded as companyFounded, company.sector as companySector, wealth.type as wealthType, wealth.how.was_founder as wasFounder, wealth.how.inherited as inherited, wealth.how.was_political as wasPolitical, wealth.how.industry as industry, wealth.how.from_emerging as fromEmerging, wealth.how.category as category, wealth.worth_in_biilions as worthInBillions;

    1. Check the structure once using describe billionairesFinal; :

    billionairesFinal: {name: chararray,age: int,gender: chararray,year: int,countryCode: chararray,citizenship: chararray,region: chararray,gdp: double,rank: int,companyRelationship: chararray,companyName: chararray,companyType: chararray,companyFounded: int,companySector: chararray,wealthType: chararray,wasFounder: chararray,inherited: chararray,wasPolitical: chararray,industry: chararray,fromEmerging: chararray,category: chararray,worthInBillions: double}

    This was the intended structure of data that I wanted in Pig! Now I can ahead on and analyse the dataset :)