Search code examples
pythonpandasdataframemerge

Combining data with multiple dates into one row for each case


I have a bunch of data and each row has a 4 date fields. The first date field is a duplicate for the ID number and sometimes not. It looks a little like this:

ID,LName,FName,DateIn,DateOut,Days,ODateIn,ODateOut,Odays
1,Doe,Jay,7/14/2023,8/14/2023,31.00,8/15/2023,4/22/2024,251.00
1,Doe,Jay,3/4/2021,11/5/2021,246.00,11/12/2021,12/31/2021,49.00
1,Doe,Jay,7/14/2023,8/14/2023,31.00,5/30/2024,7/2/2024,33.00
1,Doe,Jay,5/8/2022,1/1/2023,238.00,2/28/2023,4/8/2023,39.00
2,Smith,Dude,4/16/2022,6/2/2022,47.00,7/23/2022,9/13/2022,52.00
2,Smith,Dude,12/5/2022,3/14/2023,99.00,8/30/2023,10/11/2023,42.00
2,Smith,Dude,1/3/2024,3/30/2024,87.00,7/18/2024,9/1/2024,45.00
3,Doe,Jane,4/6/2020,8/10/2020,126.00,11/12/2020,1/18/2021,67.00
3,Doe,Jane,4/6/2020,8/10/2020,126.00,3/27/2021,6/9/2021,74.00
3,Doe,Jane,4/6/2020,8/10/2020,126.00,10/4/2021,11/30/2021,57.00

I would like to clean it up to combine them into one row for each ID number. It would look like this:

ID,DateIn1,DateOut1,Days1,DateIn2,DateOut2,Days2,DateIn3,DateOut3,Days3,ODateIn1,ODateOut1,Days1,ODateIn2,ODateOut2,Days2,ODateIn3,ODateOut3,Days3,ODateIn4,ODateOut4,Days4
1,3/4/2021,11/5/2021,246.00,5/8/2022,1/1/2023,238.00,7/14/2023,8/14/2023,31.00,11/12/2021,12/31/2021,49.00,2/28/2023,4/8/2023,39.00,8/15/2023,4/22/2024,251.00,5/30/2024,7/2/2024,33.00
2,4/16/2022,6/2/2022,47.00,12/5/2022,3/14/2023,99.00,1/3/2024,3/30/2024,87.00,7/23/2022,9/13/2022,52.00,8/30/2023,10/11/2023,42.00,7/18/2024,9/1/2024,45.00,,,
3,4/6/2020,8/10/2020,126.00,,,,,,,11/12/2020,1/18/2021,67.00,3/27/2021,6/9/2021,74.00,10/4/2021,11/30/2021,57.00,,,

I tried the pivot method but it didn't work because of my duplicate values in the first two sets of dates. Anyone have any insight into what could make this work?


Solution

  • since you haven't shared any of your own code am presuming you've not done anything (if you have , exellent, post for feedback) :-(

    Below a stab joining your example rows to facilitate panda'ing them ... Could easily be improved.

    pseudo code

    • read input csv, appending common key records into buffer array
    • scan buffer to pad 'short' records with empty/null fields
    • write buffer to a 'flattened' csv
    • load flattened csv into a panda dataframe (to test )
    • display end
    import csv
    
    key=[]       # fields used to build a unique key
    flattened=[] # holds a joined row
    header=[]
    hdrNum=1 # used when generating new column names
    skipHeader=True
    hdrTemplate=[] # repeating field names for joined row (incremented to provide unique column name)
    
    mxFlds=0 # used to determine how many ',' need to be added to rows with < mxFlds 
    
    buffer=[] # holds processed rows, written out to csv
    
    with open('flattenMe.csv', newline='') as csvfile:
        spam = csv.reader(csvfile, delimiter=',' )
        for row in spam:
            if skipHeader == True:
                header=row[0:3]     # 
                hdrTemplate=row[3:] # we need this when adding new columns
                for h in row[3:]:
                    header.append(h+str(hdrNum))
                hdrNum=hdrNum+1
                skipHeader=False # so we only do this block once
    
                continue
    
            if key == []:
                key=row[0:3]
                flattened=row
            else:
                thiskey=row[0:3]
                if thiskey != key: #key change processing
    
                    buffer.append(flattened)
                    hdrNum=hdrNum+1
    
                    flattened=row
                    key=thiskey
                else:
                    for fld in row[3:]:
                        flattened.append(fld)
    
                    if not hdrTemplate[-1]+str(hdrNum) in header: # this key set more rows than any previous
    
                        for fld in hdrTemplate:
                            header.append(fld+str(hdrNum))
    
            if len(header) > mxFlds:
                mxFlds=len(header)
    
    buffer.append(flattened)
    
    #
    # write results back to a csv
    #
    with open('flattened.csv', 'w' ) as spam:
        slice = csv.writer(spam, dialect='unix', quoting=csv.QUOTE_NONE )
        slice.writerow( header )
        for r in buffer:
            slice.writerow( r )
    
    #
    # pandas
    #
    import pandas as pd
    acsv = pd.read_csv('flattened.csv')
    
    #
    # show some detail
    #
    print(acsv)
    
    

    running produces:

       ID  LName FName    DateIn1   DateOut1  Days1    ODateIn1   ODateOut1  Odays1  ...    ODateIn4  ODateOut4  Odays4   DateIn5   DateOut5  Days5    ODateIn5   ODateOut5  Odays5
    0   1    Doe   Jay  7/14/2023  8/14/2023   31.0   8/15/2023   4/22/2024   251.0  ...   2/28/2023   4/8/2023    39.0       NaN        NaN    NaN         NaN         NaN     NaN
    1   2  Smith  Dude  4/16/2022   6/2/2022   47.0   7/23/2022   9/13/2022    52.0  ...         NaN        NaN     NaN       NaN        NaN    NaN         NaN         NaN     NaN
    2   3    Doe  Jane   4/6/2020  8/10/2020  126.0  11/12/2020   1/18/2021    67.0  ...         NaN        NaN     NaN       NaN        NaN    NaN         NaN         NaN     NaN
    3   4    Pat   May  4/11/2020  8/10/2020  126.0   10/4/2021  11/30/2021    57.0  ...  11/12/2020  1/18/2021    67.0  3/9/2021  11/5/2021  246.0  11/12/2021  12/31/2021    49.0
    
    [4 rows x 33 columns]
    
    

    feel free use/butcher/ignore as appropriate. E&O acknowledged, all critique gratefully taken onboard but will not be posting any more on this