Search code examples
pythonjsonpandasjson-normalize

parse a quite nested Json file with Pandas/Python, the json thing is now in one column of a dataframe


I have retrieved data I need in one dataframe, one of the column has this list of dict

[{'date': '2023-02-03T00:00:00', 'groups': [{'periodType': 'm', 
                                              'projections': [{'identifier': 'TD3BALMO', 'period': 'Feb 23', 'value': 54.621, 'validFrom': '2023-02-01', 'validTo': '2023-02-28', 'nextRolloverDate': '2023-02-28', 'archiveDate': '2023-02-03'}, 
                                                              {'identifier': 'TD3CURMON', 'period': 'Feb 23', 'value': 53.855, 'validFrom': '2023-02-01', 'validTo': '2023-02-28', 'nextRolloverDate': '2023-02-28', 'archiveDate': '2023-02-03'}, 
                                                              {'identifier': 'TD3+1_M', 'period': 'Mar 23', 'value': 55.387, 'validFrom': '2023-03-01', 'validTo': '2023-03-31', 'nextRolloverDate': '2023-02-28', 'archiveDate': '2023-02-03'}, 
                                                              {'identifier': 'TD3+2_M', 'period': 'Apr 23', 'value': 55.174, 'validFrom': '2023-04-01', 'validTo': '2023-04-28', 'nextRolloverDate': '2023-02-28', 'archiveDate': '2023-02-03'}, 
                                                              {'identifier': 'TD3+3_M', 'period': 'May 23', 'value': 55.748, 'validFrom': '2023-05-01', 'validTo': '2023-05-31', 'nextRolloverDate': '2023-02-28', 'archiveDate': '2023-02-03'}, 
                                                              {'identifier': 'TD3+4_M', 'period': 'Jun 23', 'value': 55.608, 'validFrom': '2023-06-01', 'validTo': '2023-06-30', 'nextRolloverDate': '2023-02-28', 'archiveDate': '2023-02-03'}, 
                                                              {'identifier': 'TD3+5_M', 'period': 'Jul 23', 'value': 52.548, 'validFrom': '2023-07-01', 'validTo': '2023-07-31', 'nextRolloverDate': '2023-02-28', 'archiveDate': '2023-02-03'}]}, 
                                            {'periodType': 'q', 
                                              'projections': [{'identifier': 'TD3CURQ', 'period': 'Q1 23', 'value': 52.638, 'validFrom': '2023-01-01', 'validTo': '2023-03-31', 'nextRolloverDate': '2023-03-31', 'archiveDate': '2023-02-03'},
                                                              {'identifier': 'TD3+1Q', 'period': 'Q2 23', 'value': 55.51, 'validFrom': '2023-04-01', 'validTo': '2023-06-30', 'nextRolloverDate': '2023-03-31', 'archiveDate': '2023-02-03'}, 
                                                              {'identifier': 'TD3+2Q', 'period': 'Q3 23', 'value': 51.729, 'validFrom': '2023-07-01', 'validTo': '2023-09-29', 'nextRolloverDate': '2023-03-31', 'archiveDate': '2023-02-03'}, 
                                                              {'identifier': 'TD3+3Q', 'period': 'Q4 23', 'value': 62.63, 'validFrom': '2023-10-01', 'validTo': '2023-12-22', 'nextRolloverDate': '2023-03-31', 'archiveDate': '2023-02-03'}]
                                              }
                                            ]
  }]

What's the easiest way to convert it to below? (sorry the numbers are not the same but you get the idea) I tried json_normalize but haven't found an efficient way to convert to below? in fact I only need data from the first 3 columns: identifier, period, value

identifier  period   value  ...     validTo nextRolloverDate archiveDate
0      TD3BALMO  Feb 23  68.464  ...  2023-02-28       2023-02-28  2023-02-21
1     TD3CURMON  Feb 23  60.955  ...  2023-02-28       2023-02-28  2023-02-21
2       TD3+1_M  Mar 23  67.128  ...  2023-03-31       2023-02-28  2023-02-21
3       TD3+2_M  Apr 23  63.499  ...  2023-04-28       2023-02-28  2023-02-21
4       TD3+3_M  May 23  59.734  ...  2023-05-31       2023-02-28  2023-02-21

Solution

  • Assuming you already loaded the JSON into python object data:

    from itertools import chain
    import pandas as pd
    
    data = [{'date': '2023-02-03T00:00:00', 'groups': [{'periodType': 'm', 
                                                  'projections': [{'identifier': 'TD3BALMO', 'period': 'Feb 23', 'value': 54.621, 'validFrom': '2023-02-01', 'validTo': '2023-02-28', 'nextRolloverDate': '2023-02-28', 'archiveDate': '2023-02-03'}, 
                                                                  {'identifier': 'TD3CURMON', 'period': 'Feb 23', 'value': 53.855, 'validFrom': '2023-02-01', 'validTo': '2023-02-28', 'nextRolloverDate': '2023-02-28', 'archiveDate': '2023-02-03'}, 
                                                                  {'identifier': 'TD3+1_M', 'period': 'Mar 23', 'value': 55.387, 'validFrom': '2023-03-01', 'validTo': '2023-03-31', 'nextRolloverDate': '2023-02-28', 'archiveDate': '2023-02-03'}, 
                                                                  {'identifier': 'TD3+2_M', 'period': 'Apr 23', 'value': 55.174, 'validFrom': '2023-04-01', 'validTo': '2023-04-28', 'nextRolloverDate': '2023-02-28', 'archiveDate': '2023-02-03'}, 
                                                                  {'identifier': 'TD3+3_M', 'period': 'May 23', 'value': 55.748, 'validFrom': '2023-05-01', 'validTo': '2023-05-31', 'nextRolloverDate': '2023-02-28', 'archiveDate': '2023-02-03'}, 
                                                                  {'identifier': 'TD3+4_M', 'period': 'Jun 23', 'value': 55.608, 'validFrom': '2023-06-01', 'validTo': '2023-06-30', 'nextRolloverDate': '2023-02-28', 'archiveDate': '2023-02-03'}, 
                                                                  {'identifier': 'TD3+5_M', 'period': 'Jul 23', 'value': 52.548, 'validFrom': '2023-07-01', 'validTo': '2023-07-31', 'nextRolloverDate': '2023-02-28', 'archiveDate': '2023-02-03'}]}, 
                                                {'periodType': 'q', 
                                                  'projections': [{'identifier': 'TD3CURQ', 'period': 'Q1 23', 'value': 52.638, 'validFrom': '2023-01-01', 'validTo': '2023-03-31', 'nextRolloverDate': '2023-03-31', 'archiveDate': '2023-02-03'},
                                                                  {'identifier': 'TD3+1Q', 'period': 'Q2 23', 'value': 55.51, 'validFrom': '2023-04-01', 'validTo': '2023-06-30', 'nextRolloverDate': '2023-03-31', 'archiveDate': '2023-02-03'}, 
                                                                  {'identifier': 'TD3+2Q', 'period': 'Q3 23', 'value': 51.729, 'validFrom': '2023-07-01', 'validTo': '2023-09-29', 'nextRolloverDate': '2023-03-31', 'archiveDate': '2023-02-03'}, 
                                                                  {'identifier': 'TD3+3Q', 'period': 'Q4 23', 'value': 62.63, 'validFrom': '2023-10-01', 'validTo': '2023-12-22', 'nextRolloverDate': '2023-03-31', 'archiveDate': '2023-02-03'}]
                                                  }
                                                ]
      }]
    
    
    proj = [item.get('projections', []) for dt in data for item in dt['groups']]
    df = pd.DataFrame(chain(*proj))
    print(df)
    

    output

       identifier  period   value   validFrom     validTo nextRolloverDate archiveDate
    0    TD3BALMO  Feb 23  54.621  2023-02-01  2023-02-28       2023-02-28  2023-02-03
    1   TD3CURMON  Feb 23  53.855  2023-02-01  2023-02-28       2023-02-28  2023-02-03
    2     TD3+1_M  Mar 23  55.387  2023-03-01  2023-03-31       2023-02-28  2023-02-03
    3     TD3+2_M  Apr 23  55.174  2023-04-01  2023-04-28       2023-02-28  2023-02-03
    4     TD3+3_M  May 23  55.748  2023-05-01  2023-05-31       2023-02-28  2023-02-03
    5     TD3+4_M  Jun 23  55.608  2023-06-01  2023-06-30       2023-02-28  2023-02-03
    6     TD3+5_M  Jul 23  52.548  2023-07-01  2023-07-31       2023-02-28  2023-02-03
    7     TD3CURQ   Q1 23  52.638  2023-01-01  2023-03-31       2023-03-31  2023-02-03
    8      TD3+1Q   Q2 23  55.510  2023-04-01  2023-06-30       2023-03-31  2023-02-03
    9      TD3+2Q   Q3 23  51.729  2023-07-01  2023-09-29       2023-03-31  2023-02-03
    10     TD3+3Q   Q4 23  62.630  2023-10-01  2023-12-22       2023-03-31  2023-02-03