Search code examples
pythonjsonpandasdictionarypython-itertools

How can I convert nested dictionary to pd.dataframe faster?


I have a json file which looks like this

{
    "file": "name",
    "main": [{
        "question_no": "Q.1",
        "question": "what is ?",
        "answer": [{
                "user": "John",
                "comment": "It is defined as",
                "value": [
                          {
                            "my_value": 5,
                            "value_2": 10
                          },
                          {
                            "my_value": 24,
                            "value_2": 30
                          }
                          ]
            },
            {
                "user": "Sam",
                "comment": "as John said above it simply means",
                "value": [
                          {
                            "my_value": 9,
                            "value_2": 10
                          },
                          {
                            "my_value": 54,
                            "value_2": 19
                          }
                          ]
            }
        ],
        "closed": "no"
    }]
}

desired result:

Question_no      question  my_value_sum      value_2_sum       user      comment
Q.1             what is ?      29                40            john    It is defined as
Q.1             what is ?      63                29            Sam     as John said above it simply means

What I have tried is data = json_normalize(file_json, "main") and then using a for loop like

for ans, row in data.iterrows():
    ....
    ....
    df = df.append(the data)

But the issue using this is that it is taking a lot of time that my client would refuse the solution. there is around 1200 items in the main list and there are 450 json files like this to convert. So this intermediate process of conversion would take almost an hour to complete.

EDIT: is it possible to get the sum of the my_value and value_2 as a column? (updated the desired result also)


Solution

  • Select dictionary by main with parameter record_path and meta:

    data = pd.json_normalize(file_json["main"], 
                             record_path='answer', 
                             meta=['question_no', 'question'])
    print (data)
       user                             comment question_no   question
    0  John                    It is defined as         Q.1  what is ?
    1   Sam  as John said above it simply means         Q.1  what is ?
    

    Then if order is important convert last N columns to first positions:

    N = 2
    data = data[data.columns[-N:].tolist() + data.columns[:-N].tolist()]
    print (data)
      question_no   question  user                             comment
    0         Q.1  what is ?  John                    It is defined as
    1         Q.1  what is ?   Sam  as John said above it simply means