So, I have a dummy df like this and save it into csv:
import pandas as pd
import io
old_data = """date,time,open,high,low,close,volume
2021-05-06,04:08:00,9150090.0,9150090.0,9125001.0,9130000.0,9.015642
2021-05-06,04:09:00,9140000.0,9145000.0,9125012.0,9134068.0,3.121043
2021-05-06,04:10:00,9133882.0,9133882.0,9125002.0,9132999.0,5.536345
2021-05-06,04:11:00,9132999.0,9135013.0,9131000.0,9132999.0,5.880620"""
new_data = """timestamp,open,high,low,close,volume
1620274080000,9150090.0,9150090.0,9125001.0,9130000.0,9.015641820000004
1620274140000,9140000.0,9145000.0,9125012.0,9134068.0,3.121042509999999
1620274200000,9133882.0,9133882.0,9125002.0,9132999.0,5.5363449
1620274260000,9132999.0,9135013.0,9131000.0,9132999.0,5.88062024"""
I try to check if there is duplicated data between df_old and df_new and if any I drop it:
raw = pd.read_csv(io.StringIO(new_data), encoding='UTF-8')
stream = pd.DataFrame(raw, columns=['timestamp', 'open', 'high', 'low', 'close', 'volume'])
stream['timestamp'] = pd.to_datetime(stream['timestamp'], unit='ms')
stream['date'] = pd.to_datetime(stream['timestamp']).dt.date
stream['time'] = pd.to_datetime(stream['timestamp']).dt.time
stream = stream[['date', 'time', 'open', 'high', 'low', 'close', 'volume']]
for dif_date in stream.date.unique():
grouped = stream.groupby(stream.date)
df_new = grouped.get_group(dif_date)
df_old = pd.read_csv(io.StringIO(old_data), encoding='UTF-8')
df_stream = df_old.append(df_new).reset_index(drop=True)
df_stream = df_stream.drop_duplicates(subset=['time'])
print(df_stream)
> date time open high low close volume
> 0 2021-05-06 04:08:00 9150090.0 9150090.0 9125001.0 9130000.0 9.015642
> 1 2021-05-06 04:09:00 9140000.0 9145000.0 9125012.0 9134068.0 3.121043
> 2 2021-05-06 04:10:00 9133882.0 9133882.0 9125002.0 9132999.0 5.536345
> 3 2021-05-06 04:11:00 9132999.0 9135013.0 9131000.0 9132999.0 5.880620
> 4 2021-05-06 04:08:00 9150090.0 9150090.0 9125001.0 9130000.0 9.015642
> 5 2021-05-06 04:09:00 9140000.0 9145000.0 9125012.0 9134068.0 3.121043
> 6 2021-05-06 04:10:00 9133882.0 9133882.0 9125002.0 9132999.0 5.536345
> 7 2021-05-06 04:11:00 9132999.0 9135013.0 9131000.0 9132999.0 5.880620
but the result still returned duplicated value, how to resolve this issue or to reorder it? https://colab.research.google.com/drive/1vMx9hXKcbz8SDawTnHbzpV6JiRZsEuVP?usp=sharing thanks before
The type along the time column is not constant therefore python is not able to tell the rows are equal.
For instance if you run:
df_stream.time.loc[0] == df_stream.time.loc[4]
You'll get False because the left hand side is a string and the right hand side is a datetime.time object.
You should force a type on the column 'time' with astype()