Search code examples
pythonpandasdataframegroup-by

How to use IF NOT IN in pandas groupby object?


I have such a dataframe:

import pandas as pd
import numpy as np
# create a sample DataFrame
data = {'ID': [1, 1, 1, 2, 2, 2],
        'timestamp': ['2022-01-01 12:00:00', '2022-01-01 13:00:00', '2022-01-01 18:00:00',
                      '2022-01-01 12:02:00', '2022-01-01 13:02:00', '2022-01-01 18:02:00'],
        'value1': [10, 20, 30, 40, 50, 60],
        'gender': ['M', 'M', 'F', 'F', 'F', 'M'],
        'age': [20, 25, 30, 35, 40, 45]}
df = pd.DataFrame(data)

# extract the date from the timestamp column
df['date'] = pd.to_datetime(df['timestamp']).dt.date

I would like for this dataframe, to get timestamp values and enumerate them. Then, I will take a single value of a timestamp and check in groupby object if it exists or not. If it does not exist, I will append it. Here is my approach:

for indx, single_date in enumerate(df.timestamp):
    #print(single_date)
    if df.timestamp[indx] not in df.groupby(['ID'],as_index=False):
        df2 = pd.DataFrame([[df.ID[indx],df.timestamp[indx],np.nan,df.gender[indx],df.age[indx]]],
                           columns=['ID', 'timestamp', 'value1', 'gender', 'age'])
        #print(df2)
        df2['timestamp'] = pd.to_datetime(df2['timestamp'])
        new_ckd = df.groupby(['ID']).apply(lambda y: pd.concat([y, df2]))
new_ckd['timestamp'] = pd.to_datetime(new_ckd['timestamp'])
new_ckd = new_ckd.sort_values(by=['timestamp'], ascending=True).reset_index(drop=True)
#print(new_ckd)
    #print(df.ID[indx])
print(df.groupby(['ID'],as_index=False).timestamp.apply(print))
for indx, single_date in enumerate(df.timestamp):
    #print(df.timestamp[indx])
    if df.timestamp[indx] in df.groupby(['ID'],as_index=False).timestamp:
        print('a')

I realized that IF NOT IN condition on groupby object does not work. How can I make it work?

What I have:

ID value1 timestamp gender age
1 50 2022-01-01 12:00:00 m 7
1 80 2022-01-01 12:30:00 m 7
1 65 2022-01-01 13:00:00 m 7
2 65 2022-01-01 12:02:00 f 8
2 83 2022-01-01 12:22:00 f 8
2 63 2022-01-01 12:42:00 f 8

What I expect:

ID value1 timestamp gender age
1 50 2022-01-01 12:00:00 m 7
1 NaN 2022-01-01 12:02:00 m 7
1 NaN 2022-01-01 12:22:00 m 7
1 80 2022-01-01 12:30:00 m 7
1 NaN 2022-01-01 12:42:00 m 7
1 65 2022-01-01 13:00:00 m 7
2 NaN 2022-01-01 12:00:00 f 8
2 65 2022-01-01 12:02:00 f 8
2 83 2022-01-01 12:22:00 f 8
2 NaN 2022-01-01 12:30:00 f 8
2 63 2022-01-01 12:42:00 f 8
2 NaN 2022-01-01 13:00:00 f 8

Solution

  • You can reimagine your task as: add missing dates to every unique ID based on all dates present in the dataframe and fill NaNs in the result.

    This can be achieved for example with some magic using reindexing via multiindex and then filling the resulted NaNs:

    data = {'ID': [1, 1, 1, 2, 2, 2],
            'timestamp': ['2022-01-01 12:00:00', '2022-01-01 13:00:00', '2022-01-01 18:00:00',
                          '2022-01-01 12:02:00', '2022-01-01 13:02:00', '2022-01-01 18:02:00'],
            'value1': [10, 20, 30, 40, 50, 60],
            'gender': ['M', 'M', 'F', 'F', 'F', 'M'],
            'age': [20, 25, 30, 35, 40, 45]}
    df = pd.DataFrame(data)
    
    # cross apply to build index 
    cross = df[['ID']].drop_duplicates().merge(df[['timestamp']].drop_duplicates(), how = 'cross')
    multiIdx = pd.MultiIndex.from_frame(cross)
    
    # "add" missing rows
    df = df.set_index(['ID', 'timestamp']) \
            .reindex(multiIdx, fill_value=np.nan) \
            .reset_index() \
            .sort_values(by=['ID', 'timestamp'], ignore_index=True)
    
    # fill NaNs
    df[['gender', 'age']] = df.groupby('ID')[['gender', 'age']].ffill().bfill()
    

    UPD

    If you have non-unique entries (based on ID + timestamp pair) you can use left merge:

    cross = ...
    df = cross.merge(df, on=['ID', 'timestamp'], how='left').sort_values(by=['ID', 'timestamp'],ignore_index=True)
    df[['gender', 'age']] = df.groupby('ID')[['gender', 'age']].ffill().bfill()