Search code examples
pandasflatten

Pandas df data flattenting


I have a pandas df with 3 columns. contact as col1, profile as col2 and client as col3 Input df as below-

I have tried the answer by mozway but came across 1 more scenario where institute is None.Added that scenario too in my input


                                           contact                                               profile                                                                                                                                                           client
    0  {'id': '1', 'name': 'abc', 'country': 'US'}          {'email': 'abc@gmail.com', 'status': 'good'}                                                 {'type': '1', 'code': 'aa', 'institutes': [{'institueid': '1', 'institutename': 'mock1'}], 'sourcecode': '1111'}
    1  {'id': '2', 'name': 'xyz', 'country': 'US'}  {'email': 'hello@gmail.com', 'status': 'inprogress'}  {'type': '2', 'code': 'bb', 'institutes': [{'institueid': '2', 'institutename': 'mock2'}, {'institueid': '4', 'institutename': 'mock3'}], 'sourcecode': '2222'}
    3 {'id': '3', 'name': 'xxx', 'country': 'US'}  {'email': 'hello1@gmail.com', 'status': 'inprogress'}  {'type': '3', 'code': 'bb', 'institutes': None, 'sourcecode': '2222'}

expected output

    ```
    
      id name country            email      status type code institueid institutename sourcecode
    0  1  abc      US    abc@gmail.com        good    1   aa          1         mock1       1111
    1  2  xyz      US  hello@gmail.com  inprogress    2   bb          2         mock2       2222
    2  2  xyz      US  hello@gmail.com  inprogress    2   bb          4         mock3       2222
    3  3  xxx      US  hello@gmail.com  inprogress    3   bb          None         None      2222

```

Reproducible input:

    ```
    df = pd.DataFrame(
        {
            "contact": [
                {"id": "1", "name": "abc", "country":"US"},
                {"id": "2", "name": "xyz", "country":"US"},
                {"id": "3", "name": "xxx", "country":"US"}
            ],
            "profile": [
                {"email": "abc@gmail.com", "status": "good"},
                {"email": "hello@gmail.com", "status": "inprogress"},
                {"email": "hello1@gmail.com", "status": "inprogress"}
            ],
            "client": [
                {
                    "type": "1",
                    "code": "aa",
                    "institutes": [{"institueid": "1", "institutename": "mock1"}],
                    "sourcecode": "1111"
                },
                {
                    "type": "2",
                    "code": "bb",
                    "institutes": [
                        {"institueid": "2", "institutename": "mock2"},
                        {"institueid": "4", "institutename": "mock3"},
                    ],
                    "sourcecode": "2222"
                },
    {
                    "type": "3",
                    "code": "bb",
                    "institutes": None,
                    "sourcecode": "2222"
                }
            ],
        }
    )

```

Solution

  • Using ChainMap and pandas.json_normalize:

    from collections import ChainMap
    
    s = df.agg(lambda x: dict(ChainMap(*x)), axis=1)
    meta = list(set(s.iloc[0].keys())-{'institutes'})
    # ['code', 'country', 'email', 'id', 'name', 'sourcecode', 'status', 'type']
    
    # add missing records
    for d in s.loc[s.str['institutes'].isna()]:
        d['institutes'] = [{}]
    
    out = pd.json_normalize(s, record_path='institutes', meta=meta)
    

    Output:

      institueid institutename type sourcecode country             email code id      status name
    0          1         mock1    1       1111      US     abc@gmail.com   aa  1        good  abc
    1          2         mock2    2       2222      US   hello@gmail.com   bb  2  inprogress  xyz
    2          4         mock3    2       2222      US   hello@gmail.com   bb  2  inprogress  xyz
    3        NaN           NaN    3       2222      US  hello1@gmail.com   bb  3  inprogress  xxx