Search code examples
numpypandasdata-analysis

Merge DataFrames and discard duplicates values


I'm collecting time-indexed data coming from various files, but sometimes there is some overlapping:

df1 = pd.DataFrame([1, -1, -3], columns=['A'], index=pd.date_range('2000-01-01', periods=3))
df2 = pd.DataFrame([-3, 10, 1], columns=['A'], index=pd.date_range('2000-01-03', periods=3))
pd.concat([df1, df2])

            A
2000-01-01  1
2000-01-02 -1
2000-01-03 -3

             A
2000-01-03  -3
2000-01-04  10
2000-01-05   1

             A
2000-01-01   1
2000-01-02  -1
2000-01-03  -3
2000-01-03  -3
2000-01-04  10
2000-01-05   1

1) How to clean and remove the duplicate lines ? (here 2000-01-03)

2) More generally, is there a faster / more clever way with pandas to read and merge multiple csv files than doing manually:

L=[]
for f in glob.glob('*.csv'):
    L.append(pd.read_csv(f, ...))
fulldata = pd.concat(L)                   # this can be time consuming
fulldata.remove_duplicate_lines()         # this can be time consuming too

Solution

  • IIUC you could do pd.concat and then do drop_duplicates:

    In [104]: pd.concat([df1, df2]).drop_duplicates()
    Out[104]: 
                 A
    2000-01-01   1
    2000-01-02  -1
    2000-01-03  -3
    2000-01-04  10
    2000-01-05   7
    

    EDIT

    You are right, that method isn't working properly because it drops by value not by index. For index you could duplicated for index:

    df = pd.concat([df1, df2])
    df[~df.index.duplicated()]
    
    In [107]: df[~df.index.duplicated()]
    Out[107]: 
                 A
    2000-01-01   1
    2000-01-02  -1
    2000-01-03  -3
    2000-01-04  10
    2000-01-05   1
    

    Or you could use 1st method with modification, first you need to do reset_index, and then use drop_duplicates but for index values with subset key:

     pd.concat([df1, df2]).reset_index().drop_duplicates(subset='index').set_index('index')
    
    In [118]: pd.concat([df1, df2]).reset_index().drop_duplicates(subset='index').set_index('index')
    Out[118]: 
                 A
    index         
    2000-01-01   1
    2000-01-02  -1
    2000-01-03  -3
    2000-01-04  10
    2000-01-05   1