Search code examples
pythonpandasdataframefeather

Dataframe with Array Column to New Dataframe


Actually i have an feather file that im loading to an Dataframe.

And this Dataframe have a column with languages in each row. Like the abone one:

student_id name created_at languages
1 Foo 2019-01-03 14:30:32.146000+00:00 [{'language_id': 1, 'name': 'English', 'optin_...
2 Bar 2019-01-03 14:30:32.146000+00:00 [{'language_id': 1, 'name': 'English', 'optin_...

My question is: How can i generate an new Dataframe only with student_id column and the rest of languages array?

For example the above one:

student_id language_id language_name optin_at
1 1 English 2019-01-03T14:30:32.148Z
2 1 English 2021-05-30T00:33:02.915Z
2 2 Portuguese 2022-03-07T07:42:07.082Z

// EDIT:

Exported Dataframe as JSON (orient='records') for testing purposes:

[{"student_id":"1","name":"Foo","created_at":"2019-01-03T14:30:32.146Z","languages":[{"language_id":1,"name":"English","optin_at":"2019-01-03T14:30:32.148Z"}]},{"student_id":"2","name":"Bar","created_at":"2019-01-03T14:30:32.146Z","languages":[{"language_id":1,"name":"English","optin_at":"2021-05-30T00:33:02.915Z"},{"language_id":2,"name":"Portuguese","optin_at":"2022-03-07T07:42:07.082Z"}]}]

Solution

  • You can use explode, then convert to columns with json_normalize:

    out = (df
      .explode('languages', ignore_index=True)
      .pipe(lambda d: d.join(pd.json_normalize(d.pop('languages'))
                               .rename(columns={'name': 'language_name'})
                            ))
    )
    

    Output:

      student_id name                created_at  language_id language_name                  optin_at
    0          1  Foo  2019-01-03T14:30:32.146Z            1       English  2019-01-03T14:30:32.148Z
    1          2  Bar  2019-01-03T14:30:32.146Z            1       English  2021-05-30T00:33:02.915Z
    2          2  Bar  2019-01-03T14:30:32.146Z            2    Portuguese  2022-03-07T07:42:07.082Z