Search code examples
python-3.xpandasdatasetdata-sciencefillna

Fill in the missing data using Pandas


What's the best way to fill in the missing data using Pandas . I have a list of visitors where the exit time or the entry time is missing .

visitor entry            exit
A   16/02/2016 08:46    16/02/2016 09:01
A   16/02/2016 09:20    16/02/2016 17:24
A   17/02/2016 09:12    17/02/2016 09:42
A   17/02/2016 09:55    NaT
A   17/02/2016 12:42    17/02/2016 12:56
A   17/02/2016 13:02    17/02/2016 17:32
A   17/02/2016 17:44    17/02/2016 18:24
A   18/02/2016 07:59    18/02/2016 16:40
A   18/02/2016 16:53    NaT
A   NaT                 19/02/2016 09:11
A   19/02/2016 09:27    19/02/2016 11:26
A   19/02/2016 12:28    19/02/2016 17:12
A   20/02/2016 08:44    20/02/2016 08:58
A   20/02/2016 09:16    20/02/2016 17:21

Solution

  • You can use DataFrame.ffill + DataFrame.bfill to complete with the same time of entry / exit:

    df[['entry','exit']]=df[['entry','exit']].ffill(axis=1).bfill(axis=1)
    print(df)
       visitor               entry                exit
    0        A 2016-02-16 08:46:00 2016-02-16 09:01:00
    1        A 2016-02-16 09:20:00 2016-02-16 17:24:00
    2        A 2016-02-17 09:12:00 2016-02-17 09:42:00
    3        A 2016-02-17 09:55:00 2016-02-17 09:55:00
    4        A 2016-02-17 12:42:00 2016-02-17 12:56:00
    5        A 2016-02-17 13:02:00 2016-02-17 17:32:00
    6        A 2016-02-17 17:44:00 2016-02-17 18:24:00
    7        A 2016-02-18 07:59:00 2016-02-18 16:40:00
    8        A 2016-02-18 16:53:00 2016-02-18 16:53:00
    9        A 2016-02-19 09:11:00 2016-02-19 09:11:00
    10       A 2016-02-19 09:27:00 2016-02-19 11:26:00
    11       A 2016-02-19 12:28:00 2016-02-19 17:12:00
    12       A 2016-02-20 08:44:00 2016-02-20 08:58:00
    13       A 2016-02-20 09:16:00 2016-02-20 17:21:00
    

    EDIT

    DataFrame.notna + DataFrame.all to performance a boolean indexing to filter to ros with NaT values in order to calculate the mean of the diff

    #filtering valid data
    df_valid=df[df.notna().all(axis=1)]
    #Calculating diff
    time_dif=df_valid[['entry','exit']].diff(axis=1).exit
    print(time_dif)
    
    0    00:15:00
    1    08:04:00
    2    00:30:00
    4    00:14:00
    5    04:30:00
    6    00:40:00
    7    08:41:00
    10   01:59:00
    11   04:44:00
    12   00:14:00
    13   08:05:00
    Name: exit, dtype: timedelta64[ns]
    

    #Calculatin mean        
    time_dif_mean=time_dif.mean()
    print('This is the mean of time in: ', time_dif_mean)
    
    This is the mean of time in:  0 days 03:26:54.545454
    

    Filling mising value with the mean

    #roud to seconds( optional)
    time_dif_mean_round_second=time_dif_mean.round('s')
    
    df['entry'].fillna(df['exit']-time_dif_mean_round_second,inplace=True)
    df['exit'].fillna(df['entry']+time_dif_mean_round_second,inplace=True)
    print(df)
    

    Output:

       visitor               entry                exit
    0        A 2016-02-16 08:46:00 2016-02-16 09:01:00
    1        A 2016-02-16 09:20:00 2016-02-16 17:24:00
    2        A 2016-02-17 09:12:00 2016-02-17 09:42:00
    3        A 2016-02-17 09:55:00 2016-02-17 13:21:55
    4        A 2016-02-17 12:42:00 2016-02-17 12:56:00
    5        A 2016-02-17 13:02:00 2016-02-17 17:32:00
    6        A 2016-02-17 17:44:00 2016-02-17 18:24:00
    7        A 2016-02-18 07:59:00 2016-02-18 16:40:00
    8        A 2016-02-18 16:53:00 2016-02-18 20:19:55
    9        A 2016-02-19 05:44:05 2016-02-19 09:11:00
    10       A 2016-02-19 09:27:00 2016-02-19 11:26:00
    11       A 2016-02-19 12:28:00 2016-02-19 17:12:00
    12       A 2016-02-20 08:44:00 2016-02-20 08:58:00
    13       A 2016-02-20 09:16:00 2016-02-20 17:21:00