I am relatively new to pandas and have a dataset which requires categorising the values by a subset (the year) of the dataset in pandas.
The dataset is one where there are individual rows for each reported date and I need to bin and categorise each of them into eight categories.
The current project is here: Scottish Crime Pandas Project where you can find the current notebook and parquet of where the dataset is currently.
Here is a sample of the dataset:
code | area | area_type | crime_category | count | count_per_10k | date_start | date_end | count_category_historic |
---|---|---|---|---|---|---|---|---|
S12000005 | Clackmannanshire | Council Area | Offences: Group 8: Speeding | 277 | 54 | 2013 | 2014 | Minimal |
S12000006 | Dumfries and Galloway | Council Area | Offences: Group 8: Speeding | 5474 | 364 | 2013 | 2014 | Extremely High |
S12000042 | Dundee City | Council Area | Offences: Group 8: Speeding | 1487 | 100 | 2013 | 2014 | Very Low |
S12000005 | Clackmannanshire | Council Area | Offences: Group 8: Speeding | 209 | 41 | 2014 | 2015 | Minimal |
S12000006 | Dumfries and Galloway | Council Area | Offences: Group 8: Speeding | 5478 | 365 | 2014 | 2015 | Extremely High |
S12000042 | Dundee City | Council Area | Offences: Group 8: Speeding | 857 | 58 | 2014 | 2015 | Minimal |
The date_start column is to be used to bin and categorise the data and we are looking to bin based on the count_per_10k column.
I have successfully categorised each row using np.linspace
and pd.cut
for the whole historic dataset. This has been done by:
count_10k_bins = np.linspace(min(df_crime['count_per_10k']), max(df_crime['count_per_10k']), 8)
count_10k_bins_names = ['Minimal', 'Very Low', 'Low', 'Medium', 'High', 'Very High', 'Extremely High']
df_crime['count_category_historic'] = pd.cut(df_crime['count_per_10k'],
count_10k_bins,
labels=count_10k_bins_names,
include_lowest=True)
Please see this image for a sample of the dataset - Datset Example
I cannot work out a way to categorise each individual row in the dataset based on the subset of the year it falls in. Any help would be appreciated!
Here is how you should modify your function. You did not provide enough data for the output to show any difference between count_category_historic
(6 categories) and count_category
(7 categories):
import pandas as pd
import numpy as np
data = {
'code': ['S12000005', 'S12000006', 'S12000042', 'S12000005', 'S12000006', 'S12000042'],
'area': ['Clackmannanshire', 'Dumfries and Galloway', 'Dundee City', 'Clackmannanshire', 'Dumfries and Galloway', 'Dundee City'],
'area_type': ['Council Area', 'Council Area', 'Council Area', 'Council Area', 'Council Area', 'Council Area'],
'crime_category': ['Offences: Group 8: Speeding'] * 6,
'count': [277, 5474, 1487, 209, 5478, 857],
'count_per_10k': [54, 364, 100, 41, 365, 58],
'date_start': [2013, 2013, 2013, 2014, 2014, 2014],
'date_end': [2014, 2014, 2014, 2015, 2015, 2015],
'count_category_historic': ['Minimal', 'Extremely High', 'Very Low', 'Minimal', 'Extremely High', 'Minimal']
}
df_crime = pd.DataFrame(data)
def categorize_group(group):
count_10k_bins = np.linspace(group['count_per_10k'].min(), group['count_per_10k'].max(), 8)
count_10k_bins_names = ['Minimal', 'Very Low', 'Low', 'Medium', 'High', 'Very High', 'Extremely High']
group['count_category'] = pd.cut(group['count_per_10k'],
bins=count_10k_bins,
labels=count_10k_bins_names,
include_lowest=True)
return group
df_crime = df_crime.groupby('date_start').apply(categorize_group).reset_index(drop=True)
print(df_crime)
and get:
code area area_type \
0 S12000005 Clackmannanshire Council Area
1 S12000006 Dumfries and Galloway Council Area
2 S12000042 Dundee City Council Area
3 S12000005 Clackmannanshire Council Area
4 S12000006 Dumfries and Galloway Council Area
5 S12000042 Dundee City Council Area
crime_category count count_per_10k date_start date_end \
0 Offences: Group 8: Speeding 277 54 2013 2014
1 Offences: Group 8: Speeding 5474 364 2013 2014
2 Offences: Group 8: Speeding 1487 100 2013 2014
3 Offences: Group 8: Speeding 209 41 2014 2015
4 Offences: Group 8: Speeding 5478 365 2014 2015
5 Offences: Group 8: Speeding 857 58 2014 2015
count_category_historic count_category
0 Minimal Minimal
1 Extremely High Extremely High
2 Very Low Very Low
3 Minimal Minimal
4 Extremely High Extremely High
5 Minimal Minimal