Search code examples
pythonpandasnumpyperformanceapply

Flatten a numpy array and generate pandas columns based on array position


Apologies for cross posting on topics, but I'm looking to do a variation of something I did in R using python: Counting unique characters between delimiters across a single Cell multiple times and then applying it to multiple rows

My original data looks something like this:

ID  Binary  Continuous  date1   date2   character   numpy_array
101 0   0   20-01-01    21-01-01    CHAR    array([[0.012110, 0.11211022, 0.110101, 0.102020102], [0.10201011202, 0.220, 0.02010, 0.10220]])
102 0   2   21-01-01    22-01-01    CHAR    array([[0.0232030302, 0.22, 0.0210, 0.32101011013, 0.2011230, 0.301110, 0.10, 0.012133], [0.1032230, 0.0323101, 0.10132120, 0.102033, 0.312010, 0.11310221201, 0.2, 0.3302], [0.310330, 0.20301020, 0.210303, 0.213011030, 0.0302030332, 0.1322222, 0.1202302301, 0.33220220], [0.01010303012, 0.3010330220, 0.10303230, 0.11030, 0.130102, 0.313121, 0.2113101020, 0.121010]])
103 0   5   22-01-01    23-01-01    CHAR    array([[0.11012, 0.11303013202, 0.1012012010213, 0.303320301, 0.1312, 0.0203021], [0.02310, 0.111, 0.2023, 0.01120, 0.03201120, 0.101211202], [0.22123020, 0.21010312, 0.013032020, 0.20301303233, 0.3010231323, 0.033303]])

Due to library conflicts between reticulate, R/Studio, and RHEL7, it I'm going to have to convert the numpy array in my data to pandas columns with positionally-aware variable names.

Something like the following:

ID  Binary  Continuous  date1   date2   character   array_dim   array_lengths   array_1_value_1 array_1_value_2 array_1_value_3 array_1_value_4 array_1_value_5 array_1_value_6 array_1_value_7 array_1_value_8 array_2_value_1 array_2_value_2 array_2_value_3 array_2_value_4 array_2_value_5 array_2_value_6 array_2_value_7 array_2_value_8 array_3_value_1 array_3_value_2 array_3_value_3 array_3_value_4 array_3_value_5 array_3_value_6 array_3_value_7 array_3_value_8 array_4_value_1 array_4_value_2 array_4_value_3 array_4_value_4 array_4_value_5 array_4_value_6 array_4_value_7 array_4_value_8
101 0   0   20-01-01    21-01-01    CHAR    2   [4,4]   0.012110    0.11211022  0.110101    0.102020102 NA  NA  NA  NA  0.10201011202   0.220   0.02010 0.10220 NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA
102 0   2   21-01-01    22-01-01    CHAR    4   [8,8,8,8]   0.0232030302    0.22    0.0210  0.32101011013   0.2011230   0.301110    0.10    0.012133    0.1032230   0.0323101   0.10132120  0.102033    0.312010    0.11310221201   0.2 0.3302  0.310330    0.20301020  0.210303    0.213011030 0.0302030332    0.1322222   0.1202302301    0.33220220  0.01010303012   0.301033022 0.1030323   0.1103  0.130102    0.313121    0.211310102 0.12101
103 0   5   22-01-01    23-01-01    CHAR    3   [6,6,6] 0.11012 0.11303013202   0.1012012010213 0.303320301 0.1312  0.0203021   NA  NA  0.02310 0.111   0.2023  0.01120 0.03201120  0.101211202 NA  NA  0.22123020  0.21010312  0.013032020 0.20301303233   0.3010231323    0.033303    NA  NA  NA  NA  NA  NA  NA  NA  NA  NA

Where array_dims is the number of within [] vectors within the array for that row and column in the pandas dataframe, and array_lengths is the length of each of the total number of numpy vectors, and the rest of the columns in the pandas dataframe are the total number of values within the numpy position by its location within the numpy array relative to all the other arrays in the data.

Trying to flip it directly throws a function error:

out=pd.Dataframe(df['numpy_array'].iloc[0])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: 'numpy.ndarray' object is not callable

But converting it to a pd.Series seems to work, but will require additional transposing and flattening which I feel can be done more efficiently (and without a loop using apply or something) than the way I'm doing it.

Some code to generate the sample dataframe:

df = pd.DataFrame({'ID':  [101,102,103],
        'Binary': [0,0,0],
        'Continuous': [0,2,5],
        'date1': ['20-01-01', '21-01-01', '22-01-01'],
        'date2': ['21-01-01', '22-01-01', '23-01-01'],
        'character': ['CHAR', 'CHAR', 'CHAR'],
        'numpy_array': [np.array([[0.012110, 0.11211022, 0.110101, 0.102020102], [0.10201011202, 0.220, 0.02010, 0.10220]]),np.array([[0.0232030302, 0.22, 0.0210, 0.32101011013, 0.2011230, 0.301110, 0.10, 0.012133], [0.1032230, 0.0323101, 0.10132120, 0.102033, 0.312010, 0.11310221201, 0.2, 0.3302], [0.310330, 0.20301020, 0.210303, 0.213011030, 0.0302030332, 0.1322222, 0.1202302301, 0.33220220], [0.01010303012, 0.3010330220, 0.10303230, 0.11030, 0.130102, 0.313121, 0.2113101020, 0.121010]]),np.array([[0.11012, 0.11303013202, 0.1012012010213, 0.303320301, 0.1312, 0.0203021], [0.02310, 0.111, 0.2023, 0.01120, 0.03201120, 0.101211202], [0.22123020, 0.21010312, 0.013032020, 0.20301303233, 0.3010231323, 0.033303]])]
        })

Solution

  • The easy part

    df[['array_dim',
        'array_lengths']] = df['numpy_array'].apply([len, 
                                                     lambda x: x.shape])
    

    The arrays expansion

    # Row functions
    def new_cols(row):
        '''Generate the relevant columns names'''
        return ['row_{}_col_{}'.format(r,c) for r,c in itertools.product(range(row['numpy_array'].shape[0]), 
                                                                         range(row['numpy_array'].shape[1]))]
    def new_vals(row):
        '''Flatten the numpy array'''
        return row['numpy_array'].ravel()
    
    expansion = pd.concat([pd.DataFrame(index   = pd.Index(data=[i]),
                                        columns = new_cols(df.loc[i]),
                                        data    = [new_vals(df.loc[i])]) 
                           for i in df.index], axis=0)
    
       row_0_col_0  row_0_col_1  row_0_col_2  row_0_col_3  row_1_col_0  row_1_col_1  row_1_col_2  row_1_col_3  row_0_col_4  row_0_col_5  ...  row_2_col_6  row_2_col_7  row_3_col_0  row_3_col_1  row_3_col_2  row_3_col_3  row_3_col_4  row_3_col_5  row_3_col_6  row_3_col_7
    0     0.012110      0.11211     0.110101      0.10202     0.102010      0.22000     0.020100     0.102200          NaN          NaN  ...          NaN          NaN          NaN          NaN          NaN          NaN          NaN          NaN          NaN          NaN
    1     0.023203      0.22000     0.021000      0.32101     0.103223      0.03231     0.101321     0.102033     0.201123     0.301110  ...      0.12023     0.332202     0.010103     0.301033     0.103032       0.1103     0.130102     0.313121      0.21131      0.12101
    2     0.110120      0.11303     0.101201      0.30332     0.023100      0.11100     0.202300     0.011200     0.131200     0.020302  ...          NaN          NaN          NaN          NaN          NaN          NaN          NaN          NaN          NaN          NaN
    

    Variants of the above code with iterrows or itertuples are possible.

    Finally assemble both parts:

    df_final = pd.concat([df.drop(columns = 'numpy_array'), expansion], axis=1)
    
        ID  Binary  Continuous     date1     date2 character  array_dim array_lengths  row_0_col_0  row_0_col_1  ...  row_2_col_6  row_2_col_7  row_3_col_0  row_3_col_1  row_3_col_2  row_3_col_3  row_3_col_4  row_3_col_5  row_3_col_6  row_3_col_7
    0  101       0           0  20-01-01  21-01-01      CHAR          2        (2, 4)     0.012110      0.11211  ...          NaN          NaN          NaN          NaN          NaN          NaN          NaN          NaN          NaN          NaN
    1  102       0           2  21-01-01  22-01-01      CHAR          4        (4, 8)     0.023203      0.22000  ...      0.12023     0.332202     0.010103     0.301033     0.103032       0.1103     0.130102     0.313121      0.21131      0.12101
    2  103       0           5  22-01-01  23-01-01      CHAR          3        (3, 6)     0.110120      0.11303  ...          NaN          NaN          NaN          NaN          NaN          NaN          NaN          NaN          NaN          NaN
    

    Giving up for now on non vectorized dynamic column generation. The dynamic number of columns is killing my apply attempts. Each row expansion requires its own column index.