So, I have a dummy df like this and save it into csv:
import pandas as pd
import io
old_data = """date,time,open,high,low,close,volume
new_data = """timestamp,open,high,low,close,volume
I try to check if there is duplicated data between df_old and df_new and if any I drop it:
raw = pd.read_csv(io.StringIO(new_data), encoding='UTF-8')
stream = pd.DataFrame(raw, columns=['timestamp', 'open', 'high', 'low', 'close', 'volume'])
stream['timestamp'] = pd.to_datetime(stream['timestamp'], unit='ms')
stream['date'] = pd.to_datetime(stream['timestamp'])
stream['time'] = pd.to_datetime(stream['timestamp']).dt.time
stream = stream[['date', 'time', 'open', 'high', 'low', 'close', 'volume']]
for dif_date in
grouped = stream.groupby(
df_new = grouped.get_group(dif_date)
df_old = pd.read_csv(io.StringIO(old_data), encoding='UTF-8')
df_stream = df_old.append(df_new).reset_index(drop=True)
df_stream = df_stream.drop_duplicates(subset=['time'])
> date time open high low close volume
> 0 2021-05-06 04:08:00 9150090.0 9150090.0 9125001.0 9130000.0 9.015642
> 1 2021-05-06 04:09:00 9140000.0 9145000.0 9125012.0 9134068.0 3.121043
> 2 2021-05-06 04:10:00 9133882.0 9133882.0 9125002.0 9132999.0 5.536345
> 3 2021-05-06 04:11:00 9132999.0 9135013.0 9131000.0 9132999.0 5.880620
> 4 2021-05-06 04:08:00 9150090.0 9150090.0 9125001.0 9130000.0 9.015642
> 5 2021-05-06 04:09:00 9140000.0 9145000.0 9125012.0 9134068.0 3.121043
> 6 2021-05-06 04:10:00 9133882.0 9133882.0 9125002.0 9132999.0 5.536345
> 7 2021-05-06 04:11:00 9132999.0 9135013.0 9131000.0 9132999.0 5.880620
but the result still returned duplicated value, how to resolve this issue or to reorder it? thanks before
The type along the time column is not constant therefore python is not able to tell the rows are equal.
For instance if you run:
df_stream.time.loc[0] == df_stream.time.loc[4]
You'll get False because the left hand side is a string and the right hand side is a datetime.time object.
You should force a type on the column 'time' with astype()