Search code examples
pandasdataframenumpyscikit-learnsklearn-pandas

Flatten all cells from float64 arrays to int in a Pandas dataframe


I have a Pandas DataFrame with 6 rows and 11 columns which contains a float64 array with a single value in each cell. The cells in the dataframe look like this: enter image description here

And this is what I get after transforming the dataframe to a dictionary:

{'AO': {"W": [-0.09898120815033484],
 "X": [0.025084149326805416],
 "Y": [-0.043670609717370634],
 "Z": [-0.07389705882352943],
 "A": [-0.018586460390565218],
 "B": [-0.11756766854090006]},
'DR': {"W": [0.8163265306122449],
 "X": [1.0814940577249577],
 "Y": [0.8759551706571573],
 "Z": [0.8828522920203735],
 "A": [0.9473403118991668],
 "B": [0.7733390301217689]},
'DP': {"W": [-0.14516129032258063],
 "X": [0.05955334987593053],
 "Y": [-0.10348491287717809],
 "Z": [-0.0856079404466501],
 "A": [-0.043931563001247564],
 "B": [-0.1890928533238282]},
'PD': {"W": [-0.1255102040816326],
 "X": [0.09129967776584313],
 "Y": [-0.13698152666434293],
 "Z": [-0.03421052631578947],
 "A": [-0.0456818488984998],
 "B": [-0.1711920529801324]}}

Where the indexes of each row are W,X,Y,Z,A, and B. I want to get rid of all of the numpy array structures in each cell and flatten this DataFrame so that I can only have the int/float values in each cell. How can I do this?


Solution

  • Use applymap:

    df = df.applymap(lambda x: x[0])
    

    df:

             AO        DR        DP        PD
    W -0.098981  0.816327 -0.145161 -0.125510
    X  0.025084  1.081494  0.059553  0.091300
    Y -0.043671  0.875955 -0.103485 -0.136982
    Z -0.073897  0.882852 -0.085608 -0.034211
    A -0.018586  0.947340 -0.043932 -0.045682
    B -0.117568  0.773339 -0.189093 -0.171192
    

    Timing information via perfplot:

    perfplot of timings

    from itertools import chain
    
    import numpy as np
    import pandas as pd
    import perfplot
    
    np.random.seed(5)
    
    
    def gen_data(n):
        return pd.DataFrame(np.random.random(size=(n, 4)),
                            columns=['AO', 'DR', 'DP', 'PD']) \
            .applymap(lambda x: np.array([x]))
    
    
    def chain_comprehension(df):
        return pd.DataFrame([list(chain(*i)) for i in df.values], index=df.index,
                            columns=df.columns)
    
    
    def apply_map(df):
        return df.applymap(lambda x: x[0])
    
    
    if __name__ == '__main__':
        out = perfplot.bench(
            setup=gen_data,
            kernels=[
                chain_comprehension,
                apply_map
            ],
            labels=[
                'chain_comprehension',
                'apply_map'
            ],
            n_range=[2 ** k for k in range(25)],
            equality_check=None
        )
        out.save('perfplot_results.png', transparent=False)