Search code examples
pythonpandasdataframemulti-index

How to retrieve unnamed columns after a groupby and unstack?


I have a dataset of events with a date column which I need to display in a weekly plot and do some more data processing afterwards. After some googling I found pd.Grouper(freq="W") so I am using that to group the events by week and display them. My problem is that after doing the groupby and ungroup I end up with a data frame where there is an unnamed column that I am unable to refer to except using iloc. This is an issue because in later plots I am grouping by other columns so I need a way to refer to this column by name, not iloc.

Here's a reproducible example of my dataset:

from datetime import datetime
from faker import Faker

fake = Faker()

start_date = datetime(2023, 1, 1)
end_date = datetime(2023, 2, 1)

# Generate data frame of 30 random dates in January 2023
df = pd.DataFrame(
    {"date": [fake.date_time_between(start_date=start_date, end_date=end_date) for i in range(30)],
     "dummy": [1 for i in range(30)]})  # There's probably a better way of counting than this

grouper = df.set_index("date").groupby([pd.Grouper(freq="W"), 'dummy'])
result = grouper['dummy'].count().unstack('dummy').fillna(0)

The result data frame that I get has weird indexes/columns that I am unable to navigate:

>>> print(result)
dummy       1
date         
2023-01-01  1
2023-01-08  3
2023-01-15  4
2023-01-22  9
2023-01-29  8
2023-02-05  5
>>> print(result.columns)
Int64Index([1], dtype='int64', name='dummy')

Then only column here is dummy, but even after result.dummy I get an AttributeError

I've also tried result.reset_index():

dummy       date  1
0     2023-01-01  1
1     2023-01-08  3
2     2023-01-15  4
3     2023-01-22  9
4     2023-01-29  8
5     2023-02-05  5

But for this data frame I can only get the date column - the counts column named "1" cannot be accessed using result.reset_index()["1"] as I get an AttributeError

I am completely perplexed by what is going on here, pandas is really powerful but sometimes I find it incredibly unintuitive. I've checked several pages of the docs and checked if there's another index level (there isn't). Can someone who's better at pandas help me out here?

I just want a way to convert the grouped data frame into something like this:

            date  counts
0     2023-01-01       1
1     2023-01-08       3
2     2023-01-15       4
3     2023-01-22       9
4     2023-01-29       8
5     2023-02-05       5

Where date and counts are columns and there is an unnamed index


Solution

  • You can solve this by simply doing:

    from datetime import datetime
    from faker import Faker
    
    fake = Faker()
    
    start_date = datetime(2023, 1, 1)
    end_date = datetime(2023, 2, 1)
    
    # Generate data frame of 30 random dates in January 2023
    df = pd.DataFrame(
        {"date": [fake.date_time_between(start_date=start_date, end_date=end_date) for i in range(30)],
         "dummy": [1 for i in range(30)]})  # There's probably a better way of counting than this
    
    result = df.groupby([pd.Grouper(freq="W", key='date'), 'dummy'], squeeze=True)['dummy'].count()
    result = result.reset_index(name='counts')
    result = result.drop(['dummy'], axis = 1)
    

    which gives

            date   counts
    0 2023-01-01          3
    1 2023-01-08          7
    2 2023-01-15          5
    3 2023-01-22          5
    4 2023-01-29          8
    5 2023-02-05          2