Apologies for cross posting on topics, but I'm looking to do a variation of something I did in R using python: Counting unique characters between delimiters across a single Cell multiple times and then applying it to multiple rows
My original data looks something like this:
ID Binary Continuous date1 date2 character numpy_array
101 0 0 20-01-01 21-01-01 CHAR array([[0.012110, 0.11211022, 0.110101, 0.102020102], [0.10201011202, 0.220, 0.02010, 0.10220]])
102 0 2 21-01-01 22-01-01 CHAR array([[0.0232030302, 0.22, 0.0210, 0.32101011013, 0.2011230, 0.301110, 0.10, 0.012133], [0.1032230, 0.0323101, 0.10132120, 0.102033, 0.312010, 0.11310221201, 0.2, 0.3302], [0.310330, 0.20301020, 0.210303, 0.213011030, 0.0302030332, 0.1322222, 0.1202302301, 0.33220220], [0.01010303012, 0.3010330220, 0.10303230, 0.11030, 0.130102, 0.313121, 0.2113101020, 0.121010]])
103 0 5 22-01-01 23-01-01 CHAR array([[0.11012, 0.11303013202, 0.1012012010213, 0.303320301, 0.1312, 0.0203021], [0.02310, 0.111, 0.2023, 0.01120, 0.03201120, 0.101211202], [0.22123020, 0.21010312, 0.013032020, 0.20301303233, 0.3010231323, 0.033303]])
Due to library conflicts between reticulate
, R/Studio
, and RHEL7, it I'm going to have to convert the numpy
array in my data to pandas
columns with positionally-aware variable names.
Something like the following:
ID Binary Continuous date1 date2 character array_dim array_lengths array_1_value_1 array_1_value_2 array_1_value_3 array_1_value_4 array_1_value_5 array_1_value_6 array_1_value_7 array_1_value_8 array_2_value_1 array_2_value_2 array_2_value_3 array_2_value_4 array_2_value_5 array_2_value_6 array_2_value_7 array_2_value_8 array_3_value_1 array_3_value_2 array_3_value_3 array_3_value_4 array_3_value_5 array_3_value_6 array_3_value_7 array_3_value_8 array_4_value_1 array_4_value_2 array_4_value_3 array_4_value_4 array_4_value_5 array_4_value_6 array_4_value_7 array_4_value_8
101 0 0 20-01-01 21-01-01 CHAR 2 [4,4] 0.012110 0.11211022 0.110101 0.102020102 NA NA NA NA 0.10201011202 0.220 0.02010 0.10220 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
102 0 2 21-01-01 22-01-01 CHAR 4 [8,8,8,8] 0.0232030302 0.22 0.0210 0.32101011013 0.2011230 0.301110 0.10 0.012133 0.1032230 0.0323101 0.10132120 0.102033 0.312010 0.11310221201 0.2 0.3302 0.310330 0.20301020 0.210303 0.213011030 0.0302030332 0.1322222 0.1202302301 0.33220220 0.01010303012 0.301033022 0.1030323 0.1103 0.130102 0.313121 0.211310102 0.12101
103 0 5 22-01-01 23-01-01 CHAR 3 [6,6,6] 0.11012 0.11303013202 0.1012012010213 0.303320301 0.1312 0.0203021 NA NA 0.02310 0.111 0.2023 0.01120 0.03201120 0.101211202 NA NA 0.22123020 0.21010312 0.013032020 0.20301303233 0.3010231323 0.033303 NA NA NA NA NA NA NA NA NA NA
Where array_dims
is the number of within []
vectors within the array for that row and column in the pandas
dataframe, and array_lengths
is the length of each of the total number of numpy
vectors, and the rest of the columns in the pandas
dataframe are the total number of values within the numpy
position by its location within the numpy
array relative to all the other arrays in the data.
Trying to flip it directly throws a function error:
out=pd.Dataframe(df['numpy_array'].iloc[0])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'numpy.ndarray' object is not callable
But converting it to a pd.Series
seems to work, but will require additional transposing and flattening which I feel can be done more efficiently (and without a loop using apply
or something) than the way I'm doing it.
Some code to generate the sample dataframe:
df = pd.DataFrame({'ID': [101,102,103],
'Binary': [0,0,0],
'Continuous': [0,2,5],
'date1': ['20-01-01', '21-01-01', '22-01-01'],
'date2': ['21-01-01', '22-01-01', '23-01-01'],
'character': ['CHAR', 'CHAR', 'CHAR'],
'numpy_array': [np.array([[0.012110, 0.11211022, 0.110101, 0.102020102], [0.10201011202, 0.220, 0.02010, 0.10220]]),np.array([[0.0232030302, 0.22, 0.0210, 0.32101011013, 0.2011230, 0.301110, 0.10, 0.012133], [0.1032230, 0.0323101, 0.10132120, 0.102033, 0.312010, 0.11310221201, 0.2, 0.3302], [0.310330, 0.20301020, 0.210303, 0.213011030, 0.0302030332, 0.1322222, 0.1202302301, 0.33220220], [0.01010303012, 0.3010330220, 0.10303230, 0.11030, 0.130102, 0.313121, 0.2113101020, 0.121010]]),np.array([[0.11012, 0.11303013202, 0.1012012010213, 0.303320301, 0.1312, 0.0203021], [0.02310, 0.111, 0.2023, 0.01120, 0.03201120, 0.101211202], [0.22123020, 0.21010312, 0.013032020, 0.20301303233, 0.3010231323, 0.033303]])]
})
The easy part
df[['array_dim',
'array_lengths']] = df['numpy_array'].apply([len,
lambda x: x.shape])
The arrays expansion
# Row functions
def new_cols(row):
'''Generate the relevant columns names'''
return ['row_{}_col_{}'.format(r,c) for r,c in itertools.product(range(row['numpy_array'].shape[0]),
range(row['numpy_array'].shape[1]))]
def new_vals(row):
'''Flatten the numpy array'''
return row['numpy_array'].ravel()
expansion = pd.concat([pd.DataFrame(index = pd.Index(data=[i]),
columns = new_cols(df.loc[i]),
data = [new_vals(df.loc[i])])
for i in df.index], axis=0)
row_0_col_0 row_0_col_1 row_0_col_2 row_0_col_3 row_1_col_0 row_1_col_1 row_1_col_2 row_1_col_3 row_0_col_4 row_0_col_5 ... row_2_col_6 row_2_col_7 row_3_col_0 row_3_col_1 row_3_col_2 row_3_col_3 row_3_col_4 row_3_col_5 row_3_col_6 row_3_col_7
0 0.012110 0.11211 0.110101 0.10202 0.102010 0.22000 0.020100 0.102200 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 0.023203 0.22000 0.021000 0.32101 0.103223 0.03231 0.101321 0.102033 0.201123 0.301110 ... 0.12023 0.332202 0.010103 0.301033 0.103032 0.1103 0.130102 0.313121 0.21131 0.12101
2 0.110120 0.11303 0.101201 0.30332 0.023100 0.11100 0.202300 0.011200 0.131200 0.020302 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Variants of the above code with iterrows
or itertuples
are possible.
Finally assemble both parts:
df_final = pd.concat([df.drop(columns = 'numpy_array'), expansion], axis=1)
ID Binary Continuous date1 date2 character array_dim array_lengths row_0_col_0 row_0_col_1 ... row_2_col_6 row_2_col_7 row_3_col_0 row_3_col_1 row_3_col_2 row_3_col_3 row_3_col_4 row_3_col_5 row_3_col_6 row_3_col_7
0 101 0 0 20-01-01 21-01-01 CHAR 2 (2, 4) 0.012110 0.11211 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 102 0 2 21-01-01 22-01-01 CHAR 4 (4, 8) 0.023203 0.22000 ... 0.12023 0.332202 0.010103 0.301033 0.103032 0.1103 0.130102 0.313121 0.21131 0.12101
2 103 0 5 22-01-01 23-01-01 CHAR 3 (3, 6) 0.110120 0.11303 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Giving up for now on non vectorized dynamic column generation. The dynamic number of columns is killing my apply attempts. Each row expansion requires its own column index.