Search code examples
pythonpandascategorical-data

Pandas : fill with 0 some categorical data


I have a dataset with dates (this could potentially be anything but categorical data), and I want to count the number of occurence by day of week (and/or by month or year, with the same logic).

I have something like this, for the day of week (in french...) :

import numpy as np
import pandas as pd

# this is simulation, of course my datas are more complex ;-)
data = {'date': ["2000-01-01", "2000-05-01", "2000-11-11", "2000-11-01"], 'col_2': ['a', 'b', 'c', 'd']}
df = pd.DataFrame.from_dict(data)

# date conversion
df['date'] = pd.to_datetime(df['date'])
df['day_of_week'] = df['date'].dt.day_name(locale = 'fr_FR')

# group by
data = df.groupby('day_of_week')['day_of_week'].agg(['count'])
data.reset_index(level=0, inplace=True)

# order by... order of the day in the week
JOUR_SEMAINE = ["Lundi", "Mardi", "Mercredi", "Jeudi", "Vendredi", "Samedi", "Dimanche"]
MAPPING_SEMAINE = {day: i for i, day in enumerate(JOUR_SEMAINE)}

key = data['day_of_week'].map(MAPPING_SEMAINE)
data = data.iloc[key.argsort()]

# see
data
. day_of_week count
0 Lundi 1
1 Mercredi 1
2 Samedi 2

This is nice, but how can i fill absent days with 0, to obtain this:

. day_of_week count
0 Lundi 1
1 Mardi 0
3 Mercredi 1
4 Jeudi 0
5 Vendredi 0
6 Samedi 2
7 Dimanche 0

To be more generic, i'm searching the more automatic way to have one value for each index of my categorical array (in this case JOUR_SEMAINE), either the calculated value (count) or 0...

Does anyone have a clue?

Thank's in advance


Solution

  • First, I would create the categorical data frame. We can get the week-day names using the calendar library.

    import pandas as pd
    import calendar as cal
    
    weekdays = [wd for wd in cal.day_name]
    weekday_count_df = pd.DataFrame(index=weekdays)
    weekday_count_df
    

    Next, we put your data into a data frame.

    data = {'date': ["2000-01-01", "2000-05-01", "2000-11-11", "2000-11-01"], 'col_2': ['a', 'b', 'c', 'd']}
    data_df = pd.DataFrame(data)
    data_df['date'] = pd.to_datetime(data_df['date'])
    data_df['day_of_week'] = data_df['date'].dt.day_name()
    data_df
    

    Next, we can join the two data frames.

    data_group_df = data_df.groupby('day_of_week')['date'].count()
    weekday_count_df = weekday_count_df.join(data_group_df)
    weekday_count_df = weekday_count_df.fillna(0)
    weekday_count_df
    

    Finally, we can fix the column type and name.

    weekday_count_df = weekday_count_df.rename(columns={'date':'count'})
    weekday_count_df['count'] = weekday_count_df['count'].astype('int32')
    weekday_count_df
    

    The final result would be this:

                count
    Monday      1
    Tuesday     0
    Wednesday   1
    Thursday    0
    Friday      0
    Saturday    2
    Sunday      0