Binning and categorising rows by a subset in pandas

I am relatively new to pandas and have a dataset which requires categorising the values by a subset (the year) of the dataset in pandas.

The dataset is one where there are individual rows for each reported date and I need to bin and categorise each of them into eight categories.

The current project is here: Scottish Crime Pandas Project where you can find the current notebook and parquet of where the dataset is currently.

Here is a sample of the dataset:

code	area	area_type	crime_category	count	count_per_10k	date_start	date_end	count_category_historic
S12000005	Clackmannanshire	Council Area	Offences: Group 8: Speeding	277	54	2013	2014	Minimal
S12000006	Dumfries and Galloway	Council Area	Offences: Group 8: Speeding	5474	364	2013	2014	Extremely High
S12000042	Dundee City	Council Area	Offences: Group 8: Speeding	1487	100	2013	2014	Very Low
S12000005	Clackmannanshire	Council Area	Offences: Group 8: Speeding	209	41	2014	2015	Minimal
S12000006	Dumfries and Galloway	Council Area	Offences: Group 8: Speeding	5478	365	2014	2015	Extremely High
S12000042	Dundee City	Council Area	Offences: Group 8: Speeding	857	58	2014	2015	Minimal

The date_start column is to be used to bin and categorise the data and we are looking to bin based on the count_per_10k column.

I have successfully categorised each row using np.linspace and pd.cut for the whole historic dataset. This has been done by:

count_10k_bins = np.linspace(min(df_crime['count_per_10k']), max(df_crime['count_per_10k']), 8)

count_10k_bins_names = ['Minimal', 'Very Low', 'Low', 'Medium', 'High', 'Very High', 'Extremely High']

df_crime['count_category_historic'] = pd.cut(df_crime['count_per_10k'], 
                                             count_10k_bins, 
                                             labels=count_10k_bins_names, 
                                             include_lowest=True)

Please see this image for a sample of the dataset - Datset Example

I cannot work out a way to categorise each individual row in the dataset based on the subset of the year it falls in. Any help would be appreciated!

Solution

Here is how you should modify your function. You did not provide enough data for the output to show any difference between count_category_historic (6 categories) and count_category (7 categories):

import pandas as pd
import numpy as np

data = {
    'code': ['S12000005', 'S12000006', 'S12000042', 'S12000005', 'S12000006', 'S12000042'],
    'area': ['Clackmannanshire', 'Dumfries and Galloway', 'Dundee City', 'Clackmannanshire', 'Dumfries and Galloway', 'Dundee City'],
    'area_type': ['Council Area', 'Council Area', 'Council Area', 'Council Area', 'Council Area', 'Council Area'],
    'crime_category': ['Offences: Group 8: Speeding'] * 6,
    'count': [277, 5474, 1487, 209, 5478, 857],
    'count_per_10k': [54, 364, 100, 41, 365, 58],
    'date_start': [2013, 2013, 2013, 2014, 2014, 2014],
    'date_end': [2014, 2014, 2014, 2015, 2015, 2015],
    'count_category_historic': ['Minimal', 'Extremely High', 'Very Low', 'Minimal', 'Extremely High', 'Minimal']
}

df_crime = pd.DataFrame(data)

def categorize_group(group):
    count_10k_bins = np.linspace(group['count_per_10k'].min(), group['count_per_10k'].max(), 8)
    count_10k_bins_names = ['Minimal', 'Very Low', 'Low', 'Medium', 'High', 'Very High', 'Extremely High']
    group['count_category'] = pd.cut(group['count_per_10k'], 
                                     bins=count_10k_bins, 
                                     labels=count_10k_bins_names, 
                                     include_lowest=True)
    return group

df_crime = df_crime.groupby('date_start').apply(categorize_group).reset_index(drop=True)

print(df_crime)

and get:

        code                   area     area_type  \
0  S12000005       Clackmannanshire  Council Area   
1  S12000006  Dumfries and Galloway  Council Area   
2  S12000042            Dundee City  Council Area   
3  S12000005       Clackmannanshire  Council Area   
4  S12000006  Dumfries and Galloway  Council Area   
5  S12000042            Dundee City  Council Area   

                crime_category  count  count_per_10k  date_start  date_end  \
0  Offences: Group 8: Speeding    277             54        2013      2014   
1  Offences: Group 8: Speeding   5474            364        2013      2014   
2  Offences: Group 8: Speeding   1487            100        2013      2014   
3  Offences: Group 8: Speeding    209             41        2014      2015   
4  Offences: Group 8: Speeding   5478            365        2014      2015   
5  Offences: Group 8: Speeding    857             58        2014      2015   

  count_category_historic  count_category  
0                 Minimal         Minimal  
1          Extremely High  Extremely High  
2                Very Low        Very Low  
3                 Minimal         Minimal  
4          Extremely High  Extremely High  
5                 Minimal         Minimal