I have a timeseries dataframe with data from multiple sites that looks like this:
Site Date Variable
1 01/01/2021 -1
1 02/01/2021 0
1 03/01/2021 1
1 04/01/2021 0
1 05/01/2021 -1
1 06/01/2021 0
1 07/01/2021 1
1 08/01/2021 2
1 09/01/2021 1
1 10/01/2021 0
2 01/01/2021 -5
2 02/01/2021 3
2 03/01/2021 2
2 04/01/2021 6
2 05/01/2021 -3
2 06/01/2021 3
2 07/01/2021 1
2 08/01/2021 -4
2 09/01/2021 -5
2 10/01/2021 -1
The plotted data looks like this, with certain sites having a high range and others with a low range:
I'd like to find a way to categorize the data into groups of 'high' and 'low' range, for example Site 1 would be put into a category that has a range from -2 to 2. I would imagine these would have to be set manually by me, that's fine.
I've had a try with bins and dynamic bins, but from what I can tell these only categorize the individual variables, whereas I need the [Site] as a whole to be looked at and split into categories based on the full range of data within each site. In the end I need something that looks like this:
Site Date Variable Type
1 01/01/2021 -1 LOW
1 02/01/2021 0 LOW
1 03/01/2021 1 LOW
1 04/01/2021 0 LOW
1 05/01/2021 -1 LOW
1 06/01/2021 0 LOW
1 07/01/2021 1 LOW
1 08/01/2021 2 LOW
1 09/01/2021 1 LOW
1 10/01/2021 0 LOW
2 01/01/2021 -5 HIGH
2 02/01/2021 3 HIGH
2 03/01/2021 2 HIGH
2 04/01/2021 6 HIGH
2 05/01/2021 -3 HIGH
2 06/01/2021 3 HIGH
2 07/01/2021 1 HIGH
2 08/01/2021 -4 HIGH
2 09/01/2021 -5 HIGH
2 10/01/2021 -1 HIGH
You can calculate the range per group (=max-min) and define HIGH/LOW based on a threshold (I used 3 here):
df['Type'] = (df.groupby('Site')
['Variable']
.transform(lambda g: 'HIGH' if g.max()-g.min() > 3 else 'LOW')
)
output:
Site Date Variable Type
0 1 01/01/2021 -1 LOW
1 1 02/01/2021 0 LOW
2 1 03/01/2021 1 LOW
3 1 04/01/2021 0 LOW
4 1 05/01/2021 -1 LOW
5 1 06/01/2021 0 LOW
6 1 07/01/2021 1 LOW
7 1 08/01/2021 2 LOW
8 1 09/01/2021 1 LOW
9 1 10/01/2021 0 LOW
10 2 01/01/2021 -5 HIGH
11 2 02/01/2021 3 HIGH
12 2 03/01/2021 2 HIGH
13 2 04/01/2021 6 HIGH
14 2 05/01/2021 -3 HIGH
15 2 06/01/2021 3 HIGH
16 2 07/01/2021 1 HIGH
17 2 08/01/2021 -4 HIGH
18 2 09/01/2021 -5 HIGH
19 2 10/01/2021 -1 HIGH
For an arbitrary number of categories, use pandas.cut
:
df['range'] = (df.groupby('Site')['Variable']
.transform(lambda g: g.max()-g.min())
)
# group_name: upper bound
groups = {'LOW': 0, 'MEDIUM': 3, 'HIGH': 12}
df['Type'] = pd.cut(df['range'],
bins=list(groups.values())+[float('inf')],
labels=list(groups)
)
output:
Site Date Variable Type range
0 1 01/01/2021 -1 LOW 3
1 1 02/01/2021 0 LOW 3
2 1 03/01/2021 1 LOW 3
3 1 04/01/2021 0 LOW 3
4 1 05/01/2021 -1 LOW 3
5 1 06/01/2021 0 LOW 3
6 1 07/01/2021 1 LOW 3
7 1 08/01/2021 2 LOW 3
8 1 09/01/2021 1 LOW 3
9 1 10/01/2021 0 LOW 3
10 2 01/01/2021 -5 MEDIUM 11
11 2 02/01/2021 3 MEDIUM 11
12 2 03/01/2021 2 MEDIUM 11
13 2 04/01/2021 6 MEDIUM 11
14 2 05/01/2021 -3 MEDIUM 11
15 2 06/01/2021 3 MEDIUM 11
16 2 07/01/2021 1 MEDIUM 11
17 2 08/01/2021 -4 MEDIUM 11
18 2 09/01/2021 -5 MEDIUM 11
19 2 10/01/2021 -1 MEDIUM 11