Search code examples
pythonpandasnested

pandas combine nested dataframes into one single dataframe


I have a dataframe, where in one column (we'll call it info) all the cells/rows contain another dataframe inside. I want to loop through all the rows in this column and literally stack the nested dataframes on top of each other, because they all have the same columns

How would I go about this?


Solution

  • You could try as follows:

    import pandas as pd
    
    length=5
    
    # some dfs
    nested_dfs = [pd.DataFrame({'a': [*range(length)],
                                'b': [*range(length)]}) for x in range(length)]
    
    print(nested_dfs[0])
    
       a  b
    0  0  0
    1  1  1
    2  2  2
    3  3  3
    4  4  4
    
    # df with nested_dfs in info
    df = pd.DataFrame({'info_col': nested_dfs})
    
    # code to be implemented
    lst_dfs = df['info_col'].values.tolist()
    df_final = pd.concat(lst_dfs,axis=0, ignore_index=True)
    
    df_final.tail()
    
        a  b
    20  0  0
    21  1  1
    22  2  2
    23  3  3
    24  4  4
    

    This method should be a bit faster than the solution offered by nandoquintana, which also works.


    Incidentally, it is ill advised to name a df column info. This is because df.info is actually a function. E.g., normally df['col_name'].values.tolist() can also be written as df.col_name.values.tolist(). However, if you try this with df.info.values.tolist(), you will run into an error:

    AttributeError: 'function' object has no attribute 'values'
    

    You also run the risk of overwriting the function if you start assigning values to your column on top of doing something which you probably don't want to do. E.g.:

    print(type(df.info))
    <class 'method'>
    
    df.info=1
    
    # column is unaffected, you just create an int variable
    print(type(df.info))
    <class 'int'>
    
    # but:
    df['info']=1
    
    # your column now has all 1's
    print(type(df['info']))
    <class 'pandas.core.series.Series'>