Search code examples
pythonjsonpandasjson-normalize

how to move values from pandas columns with dicts into new columns containing just the value


I had a JSON file that was heavily nested that I normalized into a Pandas dataframe. The result was that all of the keys were turned into columns and the values into rows. The problem is it appears I have some columns that still contain dicts and were not properly normalized. The dataframe has over 8000 rows and 3000 columns so this can't be done by hand.

Here is the JSON file: https://justpaste.it/9nfke

For example:

I have the following column named:

Return.ReturnData.IRS990PF.AnalysisIncomeProducingActyGrp.OtherRevenueDescribedGrp

which contains dicts, like the following:

[{'Desc': 'MISCELLANEOUS', 'ExclusionCd': '01', 'ExclusionAmt': '13'}, {'Desc': 'GRANT REFUNDS', 'RelatedOrExemptFunctionIncmAmt': '159502'}]

As you can see there's a Desc ExclusionCd ExclusionAmt etc.

I already have populated columns in my dataframes named for these:

Return.ReturnData.IRS990PF.AnalysisIncomeProducingActyGrp.OtherRevenueDescribedGrp.BusinessCd Return.ReturnData.IRS990PF.AnalysisIncomeProducingActyGrp.OtherRevenueDescribedGrp.Desc Return.ReturnData.IRS990PF.AnalysisIncomeProducingActyGrp.OtherRevenueDescribedGrp.ExclusionAmt Return.ReturnData.IRS990PF.AnalysisIncomeProducingActyGrp.OtherRevenueDescribedGrp.ExclusionCd Return.ReturnData.IRS990PF.AnalysisIncomeProducingActyGrp.OtherRevenueDescribedGrp.RelatedOrExemptFunctionIncmAmt Return.ReturnData.IRS990PF.AnalysisIncomeProducingActyGrp.OtherRevenueDescribedGrp.UnrelatedBusinessTaxblIncmAmt

How can I move those values into their respective columns? Keep in mind that I probably have hundreds of these and I can't do it by hand. It needs to be automated. Thanks!


Solution

    • The posted data shows one record, and it's not clear if the file contains a list of these records.
    • With pandas, the best method to use to parse a list of JSON records in pd.json_normalize.
      • The sample is to long to post in the answer, but note that all the keys are parsed out and there are no dicts as values in a row, when you run this locally.
    • This key Return.ReturnData.IRS990PF is not available in the PasteBin data sample.
      • Of the IRS990 keys, only IRS990 and IRS990Schedule... are available.
    • Another way to unpack values that are lists of dicts, is with the meta parameter of json_normalize
      • See this answer for a good example
    import pandas as pd
    import json
    from pathlib import Path
    
    # path to file
    p = Path('c:\some_path\test.json')
    
    # read in the JSON file
    with p.open('r', encoding='utf-8') as f:
        data = json.loads(f.read())
    
    # parse with pandas
    df = pd.json_normalize(data)
    
    # if there's a list of dictionaries in the resulting dataframe, they can be unpacked with something like
    df['Return.ReturnData.IRS990ScheduleO.SupplementalInformationDetail'].apply(pd.json_normalize)