Search code examples
pythonpandasdataframemachine-learningdata-cleaning

Explode List containing many dictionaries in Pandas dataframe


I am having a dataset which look like follows(in dataframe):

**_id** **paper_title**   **references**                                                                  **full_text**
 1         XYZ              [{'abc':'something','def':'something'},{'def':'something'},...many others]       something
 2         XYZ              [{'abc':'something','def':'something'},{'def':'something'},...many others]       something
 3         XYZ              [{'abc':'something'},{'def':'something'},...many others]                         something

Expected:

**_id** **paper_title**   **abc**    **def**                               **full_text**
   1         XYZ          something  something                               something               
                          something  something
                          .    
                          .
                         (all the dic in list with respect to_id column)
   2         XYZ          something  something                               something               
                          something  something
                          .    
                          .
                         (all the dic in list with respect to_id column)

I have tried df['column_name'].apply(pd.Series).apply(pd.Series) to split the list and dictionaries into columns of dataframe but doesn't help as it didn't split dictionaries.

First row of my dataframe: df.head(1)


Solution

  • After a lot of Documentation reading of pandas, I found the explode method applying with apply(pd.Series) is the easiest of what I was looking for in the question.

    Here is the Code:

    df = df.explode('reference')
    

    # It explodes the lists to rows of the subset columns

    df = df['reference'].apply(pd.Series).merge(df, left_index=True, right_index=True, how ='outer')
    

    # split a list inside a Dataframe cell into rows and merge with original dataframe like (AUB) in set theory

    Sidenote: while merging look for unique values in column as there will many columns with duplicated values

    I hope this helps someone with dataframe/Series with columns having list containing multiple dictionaries and want to split list of multiple dictionaries key to new column with values as their rows.