Search code examples
pythonpandasdatetimedata-processing

Group non-unique datetime column by date and sum values in python


I have dataframe df as below:

         start_time                 end_time        count
0   2020-02-03 08:42:21.997 2020-02-03 09:34:18.737 3116
1   2020-02-03 09:34:18.837 2020-02-03 10:16:56.583 2557
2   2020-02-03 10:17:00.480 2020-02-03 13:18:51.540 10911
3   2020-02-03 13:18:51.640 2020-02-03 14:01:23.263 2551
4   2020-02-03 14:01:23.363 2020-02-03 14:43:56.977 255

I would like to group by the date only of the start_time column and sum all corresponding count values in the same day. I found a relevant answer from this post.

Using this method:

data.groupby(data.date.dt.year)

however, I received the error:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-46-7618d5285bb9> in <module>()
      1 
----> 2 df.groupby(df.date.dt.year)      # Adding ['start_time'] will return 'AttributeError: 'Series' object has no attribute 'date''.
      3 
      4 
      5 

/usr/local/lib/python3.6/dist-packages/pandas/core/generic.py in __getattr__(self, name)
   5177             if self._info_axis._can_hold_identifiers_and_holds_name(name):
   5178                 return self[name]
-> 5179             return object.__getattribute__(self, name)
   5180 
   5181     def __setattr__(self, name, value):

AttributeError: 'DataFrame' object has no attribute 'date'

What is the problem and how can I group these non-unique datetime values in the start_time column by date only and sum the values?


Edit:

In fact, I was able to do it with

import datetime 
df['date'] = df['start_time'].dt.date       # Group by 'date' of 'datetime' column
df.groupby('date').sum()                    # Sum

But I'd like to know if I could do it directly, probably something more straightforward like a one-liner as shown in the answer in the aforementioned post.


Solution

  • Super close, datetime.dt.date is how you access just the date potion of the datetime object (https://www.geeksforgeeks.org/python-pandas-series-dt-date/). Try:

    data.groupby(data["start_time"].dt.date)["count"].sum()
    

    Here is some background information about the indexing that I think you're missing:

    When we write data["start_time"], we are getting column start_time from your dataframe data. An equivalent way of getting this column is to use data.start_time. When you try to access data.date (which is equivalent to data["date"]), we get an attribute error because your dataframe data does not have a column called date.

    If the start_time column is of type datettime then it has an attribute called dt which has the attribute date which is what we are wanting to group by. We can access this through data.start_time.dt.date or data["start_time"].dt.date.

    When you write data["date"] = data["start_time"], you are creating a new column in your dataframe called date which is equal to your start_time column. You can now access it through data.date (or data["date"]) which is why your solution works.