Search code examples
pythonjsondataframejsonparserjson-normalize

How to normalise JSON data into seperate dataframe as rows and column in python


I normalised one big nested JSON file but still getting JSON data in column refer to below example. My JSON data is as follows.

    data = 
    [{
  "task" : "jump",
  "taskid" : "task-001",
  "task-req" : [ {
    "taskid" : "task-001",
    "requestID" : "task-001-r1",
    }, {
    "taskid" : "task-001",
    "requestID" : "task-001-r2",
    }, {
    "taskid" : "task-001",
    "requestID" : "task-001-r3",
    } ],
  "task-desc" : {
    "task-001-r1" : [{
      "taskid" : "task-001",
      "task" : "jump",
      "details" : "long jump",
    },
    "task-001-r2" : {
      "taskid" : "task-001",
      "task" : "jump",
      "details" : "short jump",
    },
    "task-001-r3" : {
      "taskid" : "task-001",
      "task" : "jump",
      "details" : "high jump",
    }]
  }
    }]

I want output like below:

task    taskid    requistID     details
jump    task-001  task-001-r1   long jump
jump    task-001  task-001-r2   short jump
jump    task-001  task-001-r3   high jump

I tried using json.normalise() but it does not convert into the format I need. It just adding a topmost column and puts JSON data as the value in the respective row. I am getting output as below using json.normalise()

task    taskid    task-req                                                  task-desc
jump    task-001  [{"taskid" : "task-002","requestID" : "task-002-r1"}  {"task-001-r1" : [{"taskid" : "task-001","task" : "jump","details" : "long jump"}]

do I have any way to do this?


Solution

  • As for me main problem can be that it is not normal nested JSON. task-req doesn't have nested desc but it has separated item task-desc. So it can't do it using only json_normalize. It may needs extra code to convert dataframe to expected format.

    But I think it can be much simpler if you use only for-loops (without json_normalize) to create normal list and later convert it to dataframe

    Your example data is incorrect. It seems you forgot some [ ] in "task-desc" so I added it. If you have different data then you should show them correctly in question.

    data = [
    {
      "task" : "jump",
      "taskid" : "task-001",
      "task-req" : [
        {
          "taskid" : "task-001",
          "requestID" : "task-001-r1",
        },
        {
          "taskid" : "task-001",
          "requestID" : "task-001-r2",
        },
        {
          "taskid" : "task-001",
          "requestID" : "task-001-r3",
        }
      ],
      "task-desc" :
      {
        "task-001-r1" : [
          {
            "taskid" : "task-001",
            "task" : "jump",
            "details" : "long jump",
          },
        ],
        "task-001-r2" : [
          {
            "taskid" : "task-001",
            "task" : "jump",
            "details" : "short jump",
          },
        ],
        "task-001-r3" : [
          {
            "taskid" : "task-001",
            "task" : "jump",
            "details" : "high jump",
          }
        ]
      }
    }
    ]
    
    # --- 
    
    print('--- rows ---')
    
    rows = []
    
    for item in data:
        task_name = item['task']
        task_id   = item['taskid'] 
    
        for req in item['task-req']:
            req_id = req['requestID']
    
            for req_desc in item['task-desc'][req_id]: 
                details = req_desc['details']
                rows.append( [name, task_id, req_id, details] )
    
                print(f'{name} | {task_id} | {req_id} | {details}')
      
    # ---
    
    import pandas as pd
    
    df = pd.DataFrame(rows, columns=['task', 'taskid', 'requistID', 'details'])
    
    print('--- dataframe ---')
    print(df)
    

    Result:

    --- rows ---
    jump | task-001 | task-001-r1 | long jump
    jump | task-001 | task-001-r2 | short jump
    jump | task-001 | task-001-r3 | high jump
    --- dataframe ---
       task    taskid    requistID     details
    0  jump  task-001  task-001-r1   long jump
    1  jump  task-001  task-001-r2  short jump
    2  jump  task-001  task-001-r3   high jump