I'm searching for an easier solution to categorize my data. Based on two columns i'm trying to find the best combination of the two which will include most cases. For both of the two columns i've set an deviation of .25 above and below the value.
To illustrate my problem you could run the following script. The script does exactly what it should do, but it feels unclean and rubbish.
import pandas as pd
df = pd.DataFrame({'col1':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],
'col2':[1,1,1,1.2,1.2,1.3,1.5,2,2,2.3,2.5,2.8,3,3,3],
'col3':[2,2.2,2.3,4.2,1.6,1.6,1.3,1.4,1.5,1.7,2.4,2.8,2.9,3,4]})
df_combinations = pd.DataFrame()
for i,r in df.iterrows():
range_min_c2,range_max_c2 = r['col2'] -.25,r['col2'] +.25
range_min_c3,range_max_c3 = r['col3'] -.25,r['col3'] +.25
dft = df.copy(deep=True)
dft['test'] = dft.apply(lambda x: 1 if range_min_c2 < x['col2'] < range_max_c2 and
range_min_c3 < x['col3'] < range_max_c3 else 0,axis=1)
sum_test = sum(dft['test'])
df_combinations = df_combinations.append(pd.DataFrame({'min_c2':[range_min_c2],'max_c2':[range_max_c2],
'min_c3':[range_min_c3],'max_c3':[range_max_c3],
'sum_test':[sum_test]}))
My question is pretty simple: Is there an easier (more pretty) way to get this output? Perhaps a predefined function? Anyway, thanks in advance!
In order to create the dataframe df_combinations
with the columns min_
and max_
for col2
and col3
, one way to do it is using pandas.DataFrame
as follows
deviation = .25
df_combinations = pd.DataFrame({'min_c2':df['col2'] -deviation,'max_c2':df['col2'] +deviation,
'min_c3':df['col3'] -deviation,'max_c3':df['col3'] +deviation})
[Out]:
min_c2 max_c2 min_c3 max_c3
0 0.75 1.25 1.75 2.25
1 0.75 1.25 1.95 2.45
2 0.75 1.25 2.05 2.55
3 0.95 1.45 3.95 4.45
4 0.95 1.45 1.35 1.85
5 1.05 1.55 1.35 1.85
6 1.25 1.75 1.05 1.55
7 1.75 2.25 1.15 1.65
8 1.75 2.25 1.25 1.75
9 2.05 2.55 1.45 1.95
10 2.25 2.75 2.15 2.65
11 2.55 3.05 2.55 3.05
12 2.75 3.25 2.65 3.15
13 2.75 3.25 2.75 3.25
14 2.75 3.25 3.75 4.25
Then, for the column sum_test
, one has a variety of options.
Using pandas.DataFrame.apply
with a custom lambda function
df_combinations['sum_test'] = df_combinations.apply(lambda x: sum((df['col2'] > x['min_c2']) &
(df['col2'] < x['max_c2']) &
(df['col3'] > x['min_c3']) &
(df['col3'] < x['max_c3'])),axis=1)
[Out]:
min_c2 max_c2 min_c3 max_c3 sum_test
0 0.75 1.25 1.75 2.25 2
1 0.75 1.25 1.95 2.45 3
2 0.75 1.25 2.05 2.55 2
3 0.95 1.45 3.95 4.45 1
4 0.95 1.45 1.35 1.85 2
5 1.05 1.55 1.35 1.85 2
6 1.25 1.75 1.05 1.55 1
7 1.75 2.25 1.15 1.65 2
8 1.75 2.25 1.25 1.75 2
9 2.05 2.55 1.45 1.95 1
10 2.25 2.75 2.15 2.65 1
11 2.55 3.05 2.55 3.05 3
12 2.75 3.25 2.65 3.15 3
13 2.75 3.25 2.75 3.25 3
14 2.75 3.25 3.75 4.25 1
Using a list comprehension:
df_combinations['sum_test'] = [sum([1 if x['min_c2'] < y['col2'] < x['max_c2'] and
x['min_c3'] < y['col3'] < x['max_c3'] else 0 for i,y in df.iterrows()]) for i,x in df_combinations.iterrows()]
[Out]:
min_c2 max_c2 min_c3 max_c3 sum_test
0 0.75 1.25 1.75 2.25 2
1 0.75 1.25 1.95 2.45 3
2 0.75 1.25 2.05 2.55 2
3 0.95 1.45 3.95 4.45 1
4 0.95 1.45 1.35 1.85 2
5 1.05 1.55 1.35 1.85 2
6 1.25 1.75 1.05 1.55 1
7 1.75 2.25 1.15 1.65 2
8 1.75 2.25 1.25 1.75 2
9 2.05 2.55 1.45 1.95 1
10 2.25 2.75 2.15 2.65 1
11 2.55 3.05 2.55 3.05 3
12 2.75 3.25 2.65 3.15 3
13 2.75 3.25 2.75 3.25 3
14 2.75 3.25 3.75 4.25 1
Notes:
.apply()
, so one might want to read this.