Search code examples
datetimepandasindexingdataframecorrupt

Pandas Dataframe Datetime Indices Concat/Merge/Join corrupts index order


Consider 2 .csv files containing waterlevel data with a DateTime index. Can be dowloaded from: https://www.dropbox.com/sh/50zaz9ore00j7rp/AAA2MhNrNMRImoSBWWcUBNp4a?dl=0

Imported as follows:

pbn61 = pd.read_csv('3844_61.csv, 
                                 index_col = 0, 
                                 delimiter = ';', 
                                 dayfirst = True, 
                                 usecols = ['Datumtijd','DisplayWaarde']
                   )

Same for the 2nd file. Global variable 'pbn65'.

Now I want to merge these 2 DataFrames such that I can plot both dataseries in one graph. Reason for this is that I have about 50 of these files and neither of them have the same starting date and/or time. So merging some will greatly reduce the amount of graphs I end up with.

Now I only want the data that is available in both series, since only then the data becomes relevant for research. Therefore I use the following code(s):

pbn65.columns = ['DisplayWaarde2']

result1 = pd.merge(pbn61,pbn65, left_index = True, right_index = True, how='inner')
result2 = pbn65.join(pbn61, how = 'inner')

pd.set_option('max_rows', 25000)
result2

I needed to rename one column in order to make sure it could join. Increased max rows to show counting error

Both ways result in the same issue. That is that the index is put in a wrong order. Now this is probably because the index is a DateTime of the form

DD-MM-YYYY HH:MM

and joining/merging causes pandas to count Decimal instead of DateTime.

Concatenating both DataFrames gives the following error:

result3 = pd.concat([pbn61,pbn65], axis = 1, join = 'inner')
result3

Shape of passed values is (2, 20424), indices imply (2, 19558)

Which is exactly the length of the resulting DataFrame using merge/join.

Is there a way around this issue?

P.S. I would like to keep a DateTime index since I need to have a time indication for evaluation.

P.P.S. Most files contain duplicate indices. Trying to use index.drop_duplicate seems to do nothing.


Solution

  • Solution

    pbn61 = pd.read_csv('3844_61.csv', 
                                     index_col = 0, 
                                     delimiter = ';', 
                                     dayfirst = True, 
                                     usecols = ['Datumtijd','DisplayWaarde'],
                                     parse_dates = [0],
                       )
    
    pbn65 = pd.read_csv('3847_65.csv', 
                                     index_col = 0, 
                                     delimiter = ';', 
                                     dayfirst = True, 
                                     usecols = ['Datumtijd','DisplayWaarde'],
                                     parse_dates = [0],
                       )
    
    pbn61 = pbn61.groupby(level=0).first()
    pbn65 = pbn65.groupby(level=0).first()
    
    result = pd.concat([pbn61, pbn65], axis=1).dropna()
    

    Explanation

                                     parse_dates = [0],
    

    parse_dates specifies which column should be parsed as a date

    pbn61 = pbn61.groupby(level=0).first()
    

    this takes care of duplicate indices. drop_duplicates takes care of duplicated records.

    result = pd.concat([pbn61, pbn65], axis=1).dropna()
    

    This merges the two. I find this more intuitive. There are many ways to do this.

    Demonstration

    result.plot()
    

    enter image description here