I have a DataFrame with three columns:
df.groupby('Category')
to group by these values.At each time instance, two values are recorded: one has category "True", and the other has category "False".
Within each category group, I want to compute a number and store it in column Result for each time. Result is the percentage of values between time t-60
and t
that fall between 1 and 3.
The easiest way to accomplish this is probably to calculate the total number of values in that time interval via rolling_count
, then execute rolling_apply
to count only the values from that interval that fall between 1 and 3.
Here is my code so far:
groups = df.groupby(['Category'])
for key, grp in groups:
grp = grp.reindex(grp['Time']) # reindex by time so we can count with rolling windows
grp['total'] = pd.rolling_count(grp['Value'], window=60) # count number of values in the last 60 seconds
grp['in_interval'] = ? ## Need to count number of values where 1<v<3 in the last 60 seconds
grp['Result'] = grp['in_interval'] / grp['total'] # percentage of values between 1 and 3 in the last 60 seconds
What is the proper rolling_apply()
call to find grp['in_interval']
?
Let's work through an example:
import pandas as pd
import numpy as np
np.random.seed(1)
def setup(regular=True):
N = 10
x = np.arange(N)
a = np.arange(N)
b = np.arange(N)
if regular:
timestamps = np.linspace(0, 120, N)
else:
timestamps = np.random.uniform(0, 120, N)
df = pd.DataFrame({
'Category': [True]*N + [False]*N,
'Time': np.hstack((timestamps, timestamps)),
'Value': np.hstack((a,b))
})
return df
df = setup(regular=False)
df.sort(['Category', 'Time'], inplace=True)
So the DataFrame, df
, looks like this:
In [4]: df
Out[4]:
Category Time Value Result
12 False 0.013725 2 1.000000
15 False 11.080631 5 0.500000
14 False 17.610707 4 0.333333
16 False 22.351225 6 0.250000
13 False 36.279909 3 0.400000
17 False 41.467287 7 0.333333
18 False 47.612097 8 0.285714
10 False 50.042641 0 0.250000
19 False 64.658008 9 0.125000
11 False 86.438939 1 0.333333
2 True 0.013725 2 1.000000
5 True 11.080631 5 0.500000
4 True 17.610707 4 0.333333
6 True 22.351225 6 0.250000
3 True 36.279909 3 0.400000
7 True 41.467287 7 0.333333
8 True 47.612097 8 0.285714
0 True 50.042641 0 0.250000
9 True 64.658008 9 0.125000
1 True 86.438939 1 0.333333
Now, copying @herrfz, let's define
def between(a, b):
def between_percentage(series):
return float(len(series[(a <= series) & (series < b)])) / float(len(series))
return between_percentage
between(1,3)
is a function which takes a Series as input and returns the fraction of its elements which lie in the half-open interval [1,3)
. For example,
In [9]: series = pd.Series([1,2,3,4,5])
In [10]: between(1,3)(series)
Out[10]: 0.4
Now we are going to take our DataFrame, df
, and group by Category
:
df.groupby(['Category'])
For each group in the groupby object, we will want to apply a function:
df['Result'] = df.groupby(['Category']).apply(toeach_category)
The function, toeach_category
, will take a (sub)DataFrame as input, and return a DataFrame as output. The entire result will be assigned to a new column of df
called Result
.
Now what exactly must toeach_category
do? If we write toeach_category
like this:
def toeach_category(subf):
print(subf)
then we see each subf
is a DataFrame such as this one (when Category
is False):
Category Time Value Result
12 False 0.013725 2 1.000000
15 False 11.080631 5 0.500000
14 False 17.610707 4 0.333333
16 False 22.351225 6 0.250000
13 False 36.279909 3 0.400000
17 False 41.467287 7 0.333333
18 False 47.612097 8 0.285714
10 False 50.042641 0 0.250000
19 False 64.658008 9 0.125000
11 False 86.438939 1 0.333333
We want to take the Times column, and for each time, apply a function. That's done with applymap
:
def toeach_category(subf):
result = subf[['Time']].applymap(percentage)
The function percentage
will take a time value as input, and return a value as output. The value will be the fraction of rows with values between 1 and 3. applymap
is very strict: percentage
can not take any other arguments.
Given a time t
, we can select the Value
s from subf
whose times are in the half-open interval (t-60, t]
using the ix
method:
subf.ix[(t-60 < subf['Time']) & (subf['Time'] <= t), 'Value']
And so we can find the percentage of those Values
between 1 and 3 by applying between(1,3)
:
between(1,3)(subf.ix[(t-60 < subf['Time']) & (subf['Time'] <= t), 'Value'])
Now remember that we want a function percentage
which takes t
as input and returns the above expression as output:
def percentage(t):
return between(1,3)(subf.ix[(t-60 < subf['Time']) & (subf['Time'] <= t), 'Value'])
But notice that percentage
depends on subf
, and we are not allowed to pass subf
to percentage
as an argument (again, because applymap
is very strict).
So how do we get out of this jam? The solution is to define percentage
inside toeach_category
. Python's scoping rules say that a bare name like subf
is first looked for in the Local scope, then the Enclosing scope, the the Global scope, and lastly in the Builtin scope. When percentage(t)
is called, and Python encounters subf
, Python first looks in the Local scope for the value of subf
. Since subf
is not a local variable in percentage
, Python looks for it in the Enclosing scope of the function toeach_category
. It finds subf
there. Perfect. That is just what we need.
So now we have our function toeach_category
:
def toeach_category(subf):
def percentage(t):
return between(1, 3)(
subf.ix[(t - 60 < subf['Time']) & (subf['Time'] <= t), 'Value'])
result = subf[['Time']].applymap(percentage)
return result
Putting it all together,
import pandas as pd
import numpy as np
np.random.seed(1)
def setup(regular=True):
N = 10
x = np.arange(N)
a = np.arange(N)
b = np.arange(N)
if regular:
timestamps = np.linspace(0, 120, N)
else:
timestamps = np.random.uniform(0, 120, N)
df = pd.DataFrame({
'Category': [True] * N + [False] * N,
'Time': np.hstack((timestamps, timestamps)),
'Value': np.hstack((a, b))
})
return df
def between(a, b):
def between_percentage(series):
return float(len(series[(a <= series) & (series < b)])) / float(len(series))
return between_percentage
def toeach_category(subf):
def percentage(t):
return between(1, 3)(
subf.ix[(t - 60 < subf['Time']) & (subf['Time'] <= t), 'Value'])
result = subf[['Time']].applymap(percentage)
return result
df = setup(regular=False)
df.sort(['Category', 'Time'], inplace=True)
df['Result'] = df.groupby(['Category']).apply(toeach_category)
print(df)
yields
Category Time Value Result
12 False 0.013725 2 1.000000
15 False 11.080631 5 0.500000
14 False 17.610707 4 0.333333
16 False 22.351225 6 0.250000
13 False 36.279909 3 0.200000
17 False 41.467287 7 0.166667
18 False 47.612097 8 0.142857
10 False 50.042641 0 0.125000
19 False 64.658008 9 0.000000
11 False 86.438939 1 0.166667
2 True 0.013725 2 1.000000
5 True 11.080631 5 0.500000
4 True 17.610707 4 0.333333
6 True 22.351225 6 0.250000
3 True 36.279909 3 0.200000
7 True 41.467287 7 0.166667
8 True 47.612097 8 0.142857
0 True 50.042641 0 0.125000
9 True 64.658008 9 0.000000
1 True 86.438939 1 0.166667