Search code examples
pythonpandasrangecategories

Categorizing ranges of data in a dataframe using Pandas


I have a timeseries dataframe with data from multiple sites that looks like this:

Site    Date        Variable
1       01/01/2021  -1
1       02/01/2021  0
1       03/01/2021  1
1       04/01/2021  0
1       05/01/2021  -1
1       06/01/2021  0
1       07/01/2021  1
1       08/01/2021  2
1       09/01/2021  1
1       10/01/2021  0
2       01/01/2021  -5
2       02/01/2021  3
2       03/01/2021  2
2       04/01/2021  6
2       05/01/2021  -3
2       06/01/2021  3
2       07/01/2021  1
2       08/01/2021  -4
2       09/01/2021  -5
2       10/01/2021  -1

The plotted data looks like this, with certain sites having a high range and others with a low range: Plotted Data

I'd like to find a way to categorize the data into groups of 'high' and 'low' range, for example Site 1 would be put into a category that has a range from -2 to 2. I would imagine these would have to be set manually by me, that's fine.

I've had a try with bins and dynamic bins, but from what I can tell these only categorize the individual variables, whereas I need the [Site] as a whole to be looked at and split into categories based on the full range of data within each site. In the end I need something that looks like this:

Site    Date        Variable    Type
1       01/01/2021  -1          LOW
1       02/01/2021  0           LOW
1       03/01/2021  1           LOW
1       04/01/2021  0           LOW
1       05/01/2021  -1          LOW
1       06/01/2021  0           LOW
1       07/01/2021  1           LOW
1       08/01/2021  2           LOW
1       09/01/2021  1           LOW
1       10/01/2021  0           LOW
2       01/01/2021  -5          HIGH
2       02/01/2021  3           HIGH
2       03/01/2021  2           HIGH
2       04/01/2021  6           HIGH
2       05/01/2021  -3          HIGH
2       06/01/2021  3           HIGH
2       07/01/2021  1           HIGH
2       08/01/2021  -4          HIGH
2       09/01/2021  -5          HIGH
2       10/01/2021  -1          HIGH

Solution

  • You can calculate the range per group (=max-min) and define HIGH/LOW based on a threshold (I used 3 here):

    df['Type'] = (df.groupby('Site')
                    ['Variable']
                    .transform(lambda g: 'HIGH' if g.max()-g.min() > 3 else 'LOW')
                 )
    

    output:

        Site        Date  Variable  Type
    0      1  01/01/2021        -1   LOW
    1      1  02/01/2021         0   LOW
    2      1  03/01/2021         1   LOW
    3      1  04/01/2021         0   LOW
    4      1  05/01/2021        -1   LOW
    5      1  06/01/2021         0   LOW
    6      1  07/01/2021         1   LOW
    7      1  08/01/2021         2   LOW
    8      1  09/01/2021         1   LOW
    9      1  10/01/2021         0   LOW
    10     2  01/01/2021        -5  HIGH
    11     2  02/01/2021         3  HIGH
    12     2  03/01/2021         2  HIGH
    13     2  04/01/2021         6  HIGH
    14     2  05/01/2021        -3  HIGH
    15     2  06/01/2021         3  HIGH
    16     2  07/01/2021         1  HIGH
    17     2  08/01/2021        -4  HIGH
    18     2  09/01/2021        -5  HIGH
    19     2  10/01/2021        -1  HIGH
    

    For an arbitrary number of categories, use pandas.cut:

    df['range'] = (df.groupby('Site')['Variable']
                    .transform(lambda g: g.max()-g.min())
                 )
    
    # group_name: upper bound
    groups = {'LOW': 0, 'MEDIUM': 3, 'HIGH': 12}
    df['Type'] = pd.cut(df['range'],
                        bins=list(groups.values())+[float('inf')],
                        labels=list(groups)
                       )
    

    output:

        Site        Date  Variable    Type  range
    0      1  01/01/2021        -1     LOW      3
    1      1  02/01/2021         0     LOW      3
    2      1  03/01/2021         1     LOW      3
    3      1  04/01/2021         0     LOW      3
    4      1  05/01/2021        -1     LOW      3
    5      1  06/01/2021         0     LOW      3
    6      1  07/01/2021         1     LOW      3
    7      1  08/01/2021         2     LOW      3
    8      1  09/01/2021         1     LOW      3
    9      1  10/01/2021         0     LOW      3
    10     2  01/01/2021        -5  MEDIUM     11
    11     2  02/01/2021         3  MEDIUM     11
    12     2  03/01/2021         2  MEDIUM     11
    13     2  04/01/2021         6  MEDIUM     11
    14     2  05/01/2021        -3  MEDIUM     11
    15     2  06/01/2021         3  MEDIUM     11
    16     2  07/01/2021         1  MEDIUM     11
    17     2  08/01/2021        -4  MEDIUM     11
    18     2  09/01/2021        -5  MEDIUM     11
    19     2  10/01/2021        -1  MEDIUM     11