Search code examples
pythonpandas

Binning and categorising rows by a subset in pandas


I am relatively new to pandas and have a dataset which requires categorising the values by a subset (the year) of the dataset in pandas.

The dataset is one where there are individual rows for each reported date and I need to bin and categorise each of them into eight categories.

The current project is here: Scottish Crime Pandas Project where you can find the current notebook and parquet of where the dataset is currently.

Here is a sample of the dataset:

code area area_type crime_category count count_per_10k date_start date_end count_category_historic
S12000005 Clackmannanshire Council Area Offences: Group 8: Speeding 277 54 2013 2014 Minimal
S12000006 Dumfries and Galloway Council Area Offences: Group 8: Speeding 5474 364 2013 2014 Extremely High
S12000042 Dundee City Council Area Offences: Group 8: Speeding 1487 100 2013 2014 Very Low
S12000005 Clackmannanshire Council Area Offences: Group 8: Speeding 209 41 2014 2015 Minimal
S12000006 Dumfries and Galloway Council Area Offences: Group 8: Speeding 5478 365 2014 2015 Extremely High
S12000042 Dundee City Council Area Offences: Group 8: Speeding 857 58 2014 2015 Minimal

The date_start column is to be used to bin and categorise the data and we are looking to bin based on the count_per_10k column.

I have successfully categorised each row using np.linspace and pd.cut for the whole historic dataset. This has been done by:

count_10k_bins = np.linspace(min(df_crime['count_per_10k']), max(df_crime['count_per_10k']), 8)

count_10k_bins_names = ['Minimal', 'Very Low', 'Low', 'Medium', 'High', 'Very High', 'Extremely High']

df_crime['count_category_historic'] = pd.cut(df_crime['count_per_10k'], 
                                             count_10k_bins, 
                                             labels=count_10k_bins_names, 
                                             include_lowest=True)

Please see this image for a sample of the dataset - Datset Example

I cannot work out a way to categorise each individual row in the dataset based on the subset of the year it falls in. Any help would be appreciated!


Solution

  • Here is how you should modify your function. You did not provide enough data for the output to show any difference between count_category_historic (6 categories) and count_category (7 categories):

    import pandas as pd
    import numpy as np
    
    data = {
        'code': ['S12000005', 'S12000006', 'S12000042', 'S12000005', 'S12000006', 'S12000042'],
        'area': ['Clackmannanshire', 'Dumfries and Galloway', 'Dundee City', 'Clackmannanshire', 'Dumfries and Galloway', 'Dundee City'],
        'area_type': ['Council Area', 'Council Area', 'Council Area', 'Council Area', 'Council Area', 'Council Area'],
        'crime_category': ['Offences: Group 8: Speeding'] * 6,
        'count': [277, 5474, 1487, 209, 5478, 857],
        'count_per_10k': [54, 364, 100, 41, 365, 58],
        'date_start': [2013, 2013, 2013, 2014, 2014, 2014],
        'date_end': [2014, 2014, 2014, 2015, 2015, 2015],
        'count_category_historic': ['Minimal', 'Extremely High', 'Very Low', 'Minimal', 'Extremely High', 'Minimal']
    }
    
    df_crime = pd.DataFrame(data)
    
    def categorize_group(group):
        count_10k_bins = np.linspace(group['count_per_10k'].min(), group['count_per_10k'].max(), 8)
        count_10k_bins_names = ['Minimal', 'Very Low', 'Low', 'Medium', 'High', 'Very High', 'Extremely High']
        group['count_category'] = pd.cut(group['count_per_10k'], 
                                         bins=count_10k_bins, 
                                         labels=count_10k_bins_names, 
                                         include_lowest=True)
        return group
    
    df_crime = df_crime.groupby('date_start').apply(categorize_group).reset_index(drop=True)
    
    print(df_crime)
    

    and get:

            code                   area     area_type  \
    0  S12000005       Clackmannanshire  Council Area   
    1  S12000006  Dumfries and Galloway  Council Area   
    2  S12000042            Dundee City  Council Area   
    3  S12000005       Clackmannanshire  Council Area   
    4  S12000006  Dumfries and Galloway  Council Area   
    5  S12000042            Dundee City  Council Area   
    
                    crime_category  count  count_per_10k  date_start  date_end  \
    0  Offences: Group 8: Speeding    277             54        2013      2014   
    1  Offences: Group 8: Speeding   5474            364        2013      2014   
    2  Offences: Group 8: Speeding   1487            100        2013      2014   
    3  Offences: Group 8: Speeding    209             41        2014      2015   
    4  Offences: Group 8: Speeding   5478            365        2014      2015   
    5  Offences: Group 8: Speeding    857             58        2014      2015   
    
      count_category_historic  count_category  
    0                 Minimal         Minimal  
    1          Extremely High  Extremely High  
    2                Very Low        Very Low  
    3                 Minimal         Minimal  
    4          Extremely High  Extremely High  
    5                 Minimal         Minimal