Search code examples
pythondatetimepandasnat

Python Pandas Groupby Dropping DateTime Columns


I am having some trouble using groupby.median() and groupby.mean() on a DataFrame containing intermittent NaT values. Specifically, I have several columns in a dataset calculating various time differences based on other columns. In some instances, no time difference exists, causing a NaT value similar to the example below:

Group    Category    Start Time      End Time      Time Diff
  A         1        08:00:00.000    08:00:00.500      .500
  B         1        09:00:00.000    09:02:00.000  2:00.000
  B         1        09:00:00.000      NaT           NaT
  A         2        09:00:00.000    09:02:00.000  2:00.000
  A         2        09:00:00.000    09:01:00.000  1:00.000
  A         2        08:00:00.000    08:00:01.500     1.500

Any time I run df.groupby(['Group', 'Category'].median() or .mean() any column that contains NaT is dropped from the result set. I've attempted a fillna but NaT's seemed to remain. As an added point of context, this script worked correctly in an older version of Anaconda Python (1.x). I was recently able to upgrade my work computer to 2.0.1 at which point this issue began creeping up.

EDIT: I will leave my thoughts about NaT's up above in the event that they are a factor, but upon further review it seems that my problem actually lies in the fact that these columns are timedelta64s. Does anyone know of any workarounds to obtain mean/median on timedeltas?

Thanks very much for any insight you may have!


Solution

  • After some further googling/experimentation I confirmed that the issue appeared to be related to columns which were timedelta64. In order to perform pd.groupby on these columns I first converted them to floats like so:

    df['End Time'] = df['End Time'].astype('timedelta64[ms]') / 86400000

    There may be a more elegant solution to this but this allowed me to move forward with my analysis.

    Thanks!