Search code examples
pandasdatetimetime-seriesdata-cleaningtimedelta

adding time delta specific to row (Date time Cleaning)(Date time)


series data I have 2 files. Each data observation has 20 sec delay

file 1

file 1

file 2

file 2

Expected Output

expected output

What I have done so far

my output

with code

my code


Solution

  • You can generate a date_range() with freq='20s'.

    Note that I changed April 31 to a valid date. I assume the real data will have valid dates.

    dfs = []
    for file in files:
        column_name = pd.read_csv(file, index_col=False).columns[0]
        df = pd.read_csv(file, index_col=False, names=['col1', 'col2', 'col3'], skiprows=1)
        
        # generate date range with frequency of 20 seconds
        start = pd.to_datetime(column_name, dayfirst=True)
        df['col4'] = pd.date_range(start=start, periods=len(df), freq='20s')
        
        dfs.append(df)
    df = pd.concat(dfs)
    
    #    col1  col2  col3                col4
    # 0     3     5     6 2005-03-31 23:59:00
    # 1     4     6     7 2005-03-31 23:59:20
    # 2     8     9    10 2005-03-31 23:59:40
    # 0    10    20    30 2007-06-01 23:59:00
    # 1    40    50    60 2007-06-01 23:59:20
    # 2    70    80    90 2007-06-01 23:59:40
    

    To change the frequency after concatenating, you can groupby() each sub-dataframe (each one restarts at index 0) and regenerate a date_range() with the new freq:

    freq = '30s'
    df['col4'] = df.groupby((df.index == 0).cumsum())['col4'].apply(
        lambda g: pd.Series(pd.date_range(g[0], periods=len(g), freq=freq)))
    
    #    col1  col2  col3                col4
    # 0     3     5     6 2005-03-31 23:59:00
    # 1     4     6     7 2005-03-31 23:59:30
    # 2     8     9    10 2005-04-01 00:00:00
    # 0    10    20    30 2007-01-06 23:59:00
    # 1    40    50    60 2007-01-06 23:59:30
    # 2    70    80    90 2007-01-07 00:00:00