series data I have 2 files. Each data observation has 20 sec delay
file 1
file 2
Expected Output
What I have done so far
with code
You can generate a date_range()
with freq='20s'
.
Note that I changed April 31 to a valid date. I assume the real data will have valid dates.
dfs = []
for file in files:
column_name = pd.read_csv(file, index_col=False).columns[0]
df = pd.read_csv(file, index_col=False, names=['col1', 'col2', 'col3'], skiprows=1)
# generate date range with frequency of 20 seconds
start = pd.to_datetime(column_name, dayfirst=True)
df['col4'] = pd.date_range(start=start, periods=len(df), freq='20s')
dfs.append(df)
df = pd.concat(dfs)
# col1 col2 col3 col4
# 0 3 5 6 2005-03-31 23:59:00
# 1 4 6 7 2005-03-31 23:59:20
# 2 8 9 10 2005-03-31 23:59:40
# 0 10 20 30 2007-06-01 23:59:00
# 1 40 50 60 2007-06-01 23:59:20
# 2 70 80 90 2007-06-01 23:59:40
To change the frequency after concatenating, you can groupby()
each sub-dataframe (each one restarts at index 0) and regenerate a date_range()
with the new freq
:
freq = '30s'
df['col4'] = df.groupby((df.index == 0).cumsum())['col4'].apply(
lambda g: pd.Series(pd.date_range(g[0], periods=len(g), freq=freq)))
# col1 col2 col3 col4
# 0 3 5 6 2005-03-31 23:59:00
# 1 4 6 7 2005-03-31 23:59:30
# 2 8 9 10 2005-04-01 00:00:00
# 0 10 20 30 2007-01-06 23:59:00
# 1 40 50 60 2007-01-06 23:59:30
# 2 70 80 90 2007-01-07 00:00:00