I have the following question. I'm working on a time series in pandas that has a column with an histogram with values that sometimes are positive and sometimes negative. I need to fill a new column of the dataframe with the local max/min value for each window of time or range that is constantly changing since is the difference between the row number of the moment it turns from positive to negative to the last time it turned from negative to positive and viceversa. I also need to use a pandas or numpy method for efficiency.
I been experimenting by making an auxiliary column with df.loc filled with the row position of the last time it changed from positive to negative or vice-versa like this:
df.loc[(df.Histogram.shift(1) > 0) & (df.Histogram < 0), 'LOC'] = df.index.get_loc(df.Histogram)
df.LOC.fillna(method='ffill')
(Which resulted in an error) To later trying to calculate the differences between these row positions to define the current max/min time window and then apply a df.Histogram.rolling(loc_differences).max() method but I failed to make it work since .rolling only accepts a fixed window value and since I couldn't fill a column with locations. I know there must be a simple solution for this. This is an example of what I'm looking for:
Date Histogram Max/Min Value
01/02/2021 0.2 0.7
02/02/2021 0.3 0.7
03/02/2021 0.7 0.7
04/02/2021 0.2 0.7
05/02/2021 -0.2 -0.5
06/02/2021 -0.5 -0.5
07/02/2021 -0.1 -0.5
08/02/2021 0.4 0.4
09/02/2021 0.3 0.4
10/02/2021 -0.2 -0.2
11/02/2021 0.2 0.7
12/02/2021 0.7 0.7
13/02/2021 0.2 0.7
14/02/2021 0.3 0.7
15/02/2021 0.6 0.7
16/02/2021 0.2 0.7
17/02/2021 -0.2 -0.5
18/02/2021 -0.5 -0.5
19/02/2021 -0.1 -0.5
20/02/2021 0.4 0.4
21/02/2021 0.3 0.4
22/02/2021 -0.2 -0.3
23/02/2021 -0.1 -0.3
24/02/2021 -0.3 -0.3
25/02/2021 -0.1 -0.3
16/02/2021 0.2 0.3
27/02/2021 0.1 0.3
28/02/2021 0.3 0.3
Is there a solution for this? Thanks in advance.
Here's a handy way to split your histogram data into groups of positive/negative values. Each time the grp
column increments, the histogram column changes sign and all rows with the same grp
value belong to the same interval between two sign changes.
df['grp'] = (df.Histogram > 0).astype(int).diff().abs().cumsum().fillna(0)
df.head(10)
Date Histogram grp
0 01/02/2021 0.2 0.0
1 02/02/2021 0.3 0.0
2 03/02/2021 0.7 0.0
3 04/02/2021 0.2 0.0
4 05/02/2021 -0.2 1.0
5 06/02/2021 -0.5 1.0
6 07/02/2021 -0.1 1.0
7 08/02/2021 0.4 2.0
8 09/02/2021 0.3 2.0
9 10/02/2021 -0.2 3.0
The choice of (df.Histogram > 0)
is arbitrary and counts zeroes as negative values. A histogram sequence of 0.2, 0, 0.4, 0.3
would yield groups [0, 1, 2, 2]
, a sequence of -0.2, 0, -0.4, -0.3
would yield a single group. You'll have to determine whether or not that is fine for your problem.
The .fillna(0)
is there to fill the NaN
that will arise for the first row returned by .diff()
. Note that the choice of a zero replacement value is justified: if there was a change of signs from the first to the second row, grp
would be 1 on row 2, correctly putting row 1 into its own group. If there was no change of signs, grp
will be 0 on row 2, correctly grouping it with row 1.
You can now calculate minimum/maximum values per group using groupby
like so
minmax = df.groupby('grp', as_index=False)['Histogram'].agg(
{'hist_min': min, 'hist_max': max})
df = df.merge(minmax, on='grp')
df.head(10)
Date Histogram grp hist_min hist_max
0 01/02/2021 0.2 0.0 0.2 0.7
1 02/02/2021 0.3 0.0 0.2 0.7
2 03/02/2021 0.7 0.0 0.2 0.7
3 04/02/2021 0.2 0.0 0.2 0.7
4 05/02/2021 -0.2 1.0 -0.5 -0.1
5 06/02/2021 -0.5 1.0 -0.5 -0.1
6 07/02/2021 -0.1 1.0 -0.5 -0.1
7 08/02/2021 0.4 2.0 0.3 0.4
8 09/02/2021 0.3 2.0 0.3 0.4
9 10/02/2021 -0.2 3.0 -0.2 -0.2
Finally, you can assemble your desired values using boolean indexing
df['minmax'] = df.hist_min
df.loc[df.Histogram > 0, 'minmax'] = df.hist_max[df.Histogram > 0]
df.head(10)
Date Histogram grp hist_min hist_max minmax
0 01/02/2021 0.2 0.0 0.2 0.7 0.7
1 02/02/2021 0.3 0.0 0.2 0.7 0.7
2 03/02/2021 0.7 0.0 0.2 0.7 0.7
3 04/02/2021 0.2 0.0 0.2 0.7 0.7
4 05/02/2021 -0.2 1.0 -0.5 -0.1 -0.5
5 06/02/2021 -0.5 1.0 -0.5 -0.1 -0.5
6 07/02/2021 -0.1 1.0 -0.5 -0.1 -0.5
7 08/02/2021 0.4 2.0 0.3 0.4 0.4
8 09/02/2021 0.3 2.0 0.3 0.4 0.4
9 10/02/2021 -0.2 3.0 -0.2 -0.2 -0.2
The entire process is vectorized as far as possible, so performance should be decent