Search code examples
pandasmulti-index

Dataframe of lists to multiIndex dataframe


I have a dataframe of lists, each value in the list represents the mean, std, and number of values of a larger dataset. I would like to create a subindex for three values in that list.

An example dataframe is:

np.random.seed(2)
d={i: {j:[np.random.randint(10) for i in range(0,3)] for j in ['x','y','z']} for i in ['a','b','c']}
pd.DataFrame.from_dict(d,orient='index')

Which gives:

    x   y   z
a   [1, 4, 5]   [7, 4, 4]   [0, 6, 3]
b   [7, 1, 9]   [1, 3, 8]   [3, 6, 2]
c   [1, 6, 6]   [6, 5, 0]   [6, 5, 9]

I would like:

    x              y              z
    mean std count mean std count mean std count
a   1    4   5     7    4   4     0    6   3
b   7    1   9     1    3   8     3    6   2
c   1    6   6     6    5   0     6    5   9

Solution

  • You can concatenate the inner lists with numpy concatenate and numpy vstack, build the MultiIndex columns, then generate a new dataframe:

    np.random.seed(2)
    d = {
        i: {j: [np.random.randint(10) for i in range(0, 3)] for j in ["x", "y", "z"]}
        for i in ["a", "b", "c"]
    }
    df = pd.DataFrame.from_dict(d, orient="index")
    df
    
            x          y            z
    a   [8, 8, 6]   [2, 8, 7]   [2, 1, 5]
    b   [4, 4, 5]   [7, 3, 6]   [4, 3, 7]
    c   [6, 1, 3]   [5, 8, 4]   [6, 3, 9]
    
    data = np.vstack([np.concatenate(entry) for entry in df.to_numpy()])
    columns = pd.MultiIndex.from_product([df.columns, ["mean", "std", "count"]])
    pd.DataFrame(data, columns=columns, index = df.index)
    
    
                       x                 y                    z
        mean    std count   mean    std count   mean    std count
    a      8    8   6        2      8   7       2       1   5
    b      4    4   5        7      3   6       4       3   7
    c      6    1   3        5      8   4       6       3   9
    

    UPDATE : October 5, 2021

    Another option is to convert the initial dataframe to a dictionary and concatenate with pd.concat :

    outcome = {k:pd.DataFrame([*v], 
                              columns = ['mean', 'std', 'count'], 
                              index = v.index) 
               for k,v in df.items()}
    
    pd.concat(outcome, axis = 1)
     
         x              y              z          
      mean std count mean std count mean std count
    a    8   8     6    2   8     7    2   1     5
    b    4   4     5    7   3     6    4   3     7
    c    6   1     3    5   8     4    6   3     9