I have a target variable x
and some additional variables A
and B
. I want to calculate averages (and other statistics) of x
when certain conditions for A
and B
are met. A real world example would be to calculate the average air temperature (x
) from a long series of measurements when solar radiation (A
) and wind speed (B
) fall into certain pre-defined bin ranges.
I have been able to accomplish this with loops (see example below), but I've learned that I should avoid looping over dataframes. From my research on this site I feel like there is probably a much more elegant / vectorized solution using either pd.cut
or np.select
, but I frankly couldn't figure out how to do it.
import pandas as pd
import numpy as np
n = 100
df = pd.DataFrame(
{
"x": np.random.randn(n),
"A": np.random.randn(n)+5,
"B": np.random.randn(n)+10
}
)
df.head()
output:
x A B
0 -0.585313 6.038620 9.909762
1 0.412323 3.991826 8.836848
2 0.211713 5.019520 9.667349
3 0.710699 5.353677 9.757903
4 0.681418 4.452754 10.647738
# define bin ranges
bins_A = np.arange(3, 8)
bins_B = np.arange(8, 13)
# prepare output lists
A_mins= []
A_maxs= []
B_mins= []
B_maxs= []
x_means= []
x_stds= []
x_counts= []
# loop over bins
for i_A in range(0, len(bins_A)-1):
A_min = bins_A[i_A]
A_max = bins_A[i_A+1]
for i_B in range(0, len(bins_B)-1):
B_min = bins_B[i_B]
B_max = bins_B[i_B+1]
# binning conditions for current step
conditions = np.logical_and.reduce(
[
df["A"] > A_min,
df["A"] < A_max,
df["B"] > B_min,
df["B"] < B_max,
]
)
# calculate statistics for x and store values in lists
x_values = df.loc[conditions, "x"]
x_means.append(x_values.mean())
x_stds.append(x_values.std())
x_counts.append(x_values.count())
A_mins.append(A_min)
A_maxs.append(A_max)
B_mins.append(B_min)
B_maxs.append(B_max)
binned = pd.DataFrame(
data={
"A_min": A_mins,
"A_max": A_maxs,
"B_min": B_mins,
"B_max": B_maxs,
"x_mean": x_means,
"x_std": x_stds,
"x_count": x_counts
}
)
binned.head()
output:
A_min A_max B_min B_max x_mean x_std x_count
0 3 4 8 9 0.971624 0.790972 2
1 3 4 9 10 0.302795 0.380102 3
2 3 4 10 11 0.447398 1.787659 5
3 3 4 11 12 0.462149 1.195844 2
4 4 5 8 9 0.379431 0.983965 4
We will try to keep it to pandas/NumPy so that we could leverage dataframe methods or array methods and ufuncs, while vectorizing it at their level. This makes it easier to extend the functionalities when complex problems are to be solved or statistics are to be generated, as that seems to be case here.
Now, to solve the problem while keeping it close to pandas, would be to generate intermediate IDs or tags that resemble the combined tracking of A
and B
on the given bins bins_A
and bins_B
respectively. To do so, one way would be to use searchsorted
on these two data separately -
tagsA = np.searchsorted(bins_A,df.A)
tagsB = np.searchsorted(bins_B,df.B)
Now, we are interested in the within-the-bounds cases only, hence masking is needed -
vm = (tagsB>0) & (tagsB<len(bins_B)) & (tagsA>0) & (tagsA<len(bins_A))
Let's apply this mask on the original dataframe -
dfm = df.iloc[vm]
Add in the tags for the valid ones, which would represent A_mins
and B_min
equivalents and hence would show up in the final output -
dfm['TA'] = bins_A[(tagsA-1)[vm]]
dfm['TB'] = bins_B[(tagsB-1)[vm]]
So, our tagged dataframe is ready, which could then be describe-d
to get the common stats after grouping on these two tags -
df_out = dfm.groupby(['TA','TB'])['x'].describe()
Sample run to make things clearer, while comparing against posted solution in question -
In [46]: np.random.seed(0)
...: n = 100
...: df = pd.DataFrame(
...: {
...: "x": np.random.randn(n),
...: "A": np.random.randn(n)+5,
...: "B": np.random.randn(n)+10
...: }
...: )
In [47]: binned
Out[47]:
A_min A_max B_min B_max x_mean x_std x_count
0 3 4 8 9 0.400199 0.719007 5
1 3 4 9 10 -0.268252 0.914784 6
2 3 4 10 11 0.458746 1.499419 5
3 3 4 11 12 0.939782 0.055092 2
4 4 5 8 9 0.238318 1.173704 5
5 4 5 9 10 -0.263020 0.815974 8
6 4 5 10 11 -0.449831 0.682148 12
7 4 5 11 12 -0.273111 1.385483 2
8 5 6 8 9 -0.438074 NaN 1
9 5 6 9 10 -0.009721 1.401260 16
10 5 6 10 11 0.467934 1.221720 11
11 5 6 11 12 0.729922 0.789260 3
12 6 7 8 9 -0.977278 NaN 1
13 6 7 9 10 0.211842 0.825401 7
14 6 7 10 11 -0.097307 0.427639 5
15 6 7 11 12 0.915971 0.195841 2
In [48]: df_out
Out[48]:
count mean std ... 50% 75% max
TA TB ...
3 8 5.0 0.400199 0.719007 ... 0.302472 0.976639 1.178780
9 6.0 -0.268252 0.914784 ... -0.001510 0.401796 0.653619
10 5.0 0.458746 1.499419 ... 0.462782 1.867558 1.895889
11 2.0 0.939782 0.055092 ... 0.939782 0.959260 0.978738
4 8 5.0 0.238318 1.173704 ... -0.212740 0.154947 2.269755
9 8.0 -0.263020 0.815974 ... -0.365103 0.449313 0.950088
10 12.0 -0.449831 0.682148 ... -0.436773 -0.009697 0.761038
11 2.0 -0.273111 1.385483 ... -0.273111 0.216731 0.706573
5 8 1.0 -0.438074 NaN ... -0.438074 -0.438074 -0.438074
9 16.0 -0.009721 1.401260 ... 0.345020 1.284173 1.950775
10 11.0 0.467934 1.221720 ... 0.156349 1.471263 2.240893
11 3.0 0.729922 0.789260 ... 1.139401 1.184846 1.230291
6 8 1.0 -0.977278 NaN ... -0.977278 -0.977278 -0.977278
9 7.0 0.211842 0.825401 ... 0.121675 0.398750 1.764052
10 5.0 -0.097307 0.427639 ... -0.103219 0.144044 0.401989
11 2.0 0.915971 0.195841 ... 0.915971 0.985211 1.054452
So, as mentioned earlier, we have our A_min
and B_min
in TA
and TB
, while the relevant statistics are captured in other headers. Note that this would be a multi-index dataframe. If we need to capture the equivalent array data, simply do : df_out.loc[:,['count','mean','std']].values
for the stats, while np.vstack(df_out.loc[:,['count','mean','std']].index)
for the bin interval-starts.
Alternatively, to capture the equivalent stat data without describe
, but using dataframe methods, we can do something like this -
dfmg = dfm.groupby(['TA','TB'])['x']
dfmg.size().unstack().values
dfmg.std().unstack().values
dfmg.mean().unstack().values
Alternative #1 : Using pd.cut
We can also use pd.cut
as was suggested in the question to replace searchsorted
for a more compact one as the out-of-bounds ones are handled automatically, keeping the basic idea same -
df['TA'] = pd.cut(df['A'],bins=bins_A, labels=range(len(bins_A)-1))
df['TB'] = pd.cut(df['B'],bins=bins_B, labels=range(len(bins_B)-1))
df_out = df.groupby(['TA','TB'])['x'].describe()
So, this gives us the stats. For A_min
and B_min
equivalents, simply use the index levels -
A_min = bins_A[df_out.index.get_level_values(0)]
B_min = bins_B[df_out.index.get_level_values(1)]
Or use some meshgrid method -
mA,mB = np.meshgrid(bins_A[:-1],bins_B[:-1])
A_min,B_min = mA.ravel('F'),mB.ravel('F')
bincount
We can leverage np.bincount
to get all those three stat metric values including standard-deviation, again in a vectorized manner -
lA,lB = len(bins_A),len(bins_B)
n = lA+1
x,A,B = df.x.values,df.A.values,df.B.values
tagsA = np.searchsorted(bins_A,A)
tagsB = np.searchsorted(bins_B,B)
t = tagsB*n + tagsA
L = n*lB
countT = np.bincount(t, minlength=L)
countT_x = np.bincount(t,x, minlength=L)
avg_all = countT_x/countT
count = countT.reshape(-1,n)[1:,1:-1].ravel('F')
avg = avg_all.reshape(-1,n)[1:,1:-1].ravel('F')
# Using numpy std definition for ddof case
ddof = 1.0 # default one for pandas std
grp_diffs = (x-avg_all[t])**2
std_all = np.sqrt(np.bincount(t,grp_diffs, minlength=L)/(countT-ddof))
stds = std_all.reshape(-1,n)[1:,1:-1].ravel('F')
sorting
to leverage reduceat
methods -x,A,B = df.x.values,df.A.values,df.B.values
vm = (A>bins_A[0]) & (A<bins_A[-1]) & (B>bins_B[0]) & (B<bins_B[-1])
xm = x[vm]
tagsA = np.searchsorted(bins_A,A)
tagsB = np.searchsorted(bins_B,B)
tagsAB = tagsB*(tagsA.max()+1) + tagsA
tagsABm = tagsAB[vm]
sidx = tagsABm.argsort()
tagsAB_s = tagsABm[sidx]
xms = xm[sidx]
cut_idx = np.flatnonzero(np.r_[True,tagsAB_s[:-1]!=tagsAB_s[1:],True])
N = (len(bins_A)-1)*(len(bins_B)-1)
count = np.diff(cut_idx)
avg = np.add.reduceat(xms,cut_idx[:-1])/count
stds = np.empty(N)
for ii,(s0,s1) in enumerate(zip(cut_idx[:-1],cut_idx[1:])):
stds[ii] = np.std(xms[s0:s1], ddof=1)
To get in the same or similar format as the pandas dataframe styled output, we need to reshape. Hence, it would be avg.reshape(-1,len(bins_A)-1).T
and so on.