I have a bunch of data and each row has a 4 date fields. The first date field is a duplicate for the ID number and sometimes not. It looks a little like this:
ID,LName,FName,DateIn,DateOut,Days,ODateIn,ODateOut,Odays
1,Doe,Jay,7/14/2023,8/14/2023,31.00,8/15/2023,4/22/2024,251.00
1,Doe,Jay,3/4/2021,11/5/2021,246.00,11/12/2021,12/31/2021,49.00
1,Doe,Jay,7/14/2023,8/14/2023,31.00,5/30/2024,7/2/2024,33.00
1,Doe,Jay,5/8/2022,1/1/2023,238.00,2/28/2023,4/8/2023,39.00
2,Smith,Dude,4/16/2022,6/2/2022,47.00,7/23/2022,9/13/2022,52.00
2,Smith,Dude,12/5/2022,3/14/2023,99.00,8/30/2023,10/11/2023,42.00
2,Smith,Dude,1/3/2024,3/30/2024,87.00,7/18/2024,9/1/2024,45.00
3,Doe,Jane,4/6/2020,8/10/2020,126.00,11/12/2020,1/18/2021,67.00
3,Doe,Jane,4/6/2020,8/10/2020,126.00,3/27/2021,6/9/2021,74.00
3,Doe,Jane,4/6/2020,8/10/2020,126.00,10/4/2021,11/30/2021,57.00
I would like to clean it up to combine them into one row for each ID number. It would look like this:
ID,DateIn1,DateOut1,Days1,DateIn2,DateOut2,Days2,DateIn3,DateOut3,Days3,ODateIn1,ODateOut1,Days1,ODateIn2,ODateOut2,Days2,ODateIn3,ODateOut3,Days3,ODateIn4,ODateOut4,Days4
1,3/4/2021,11/5/2021,246.00,5/8/2022,1/1/2023,238.00,7/14/2023,8/14/2023,31.00,11/12/2021,12/31/2021,49.00,2/28/2023,4/8/2023,39.00,8/15/2023,4/22/2024,251.00,5/30/2024,7/2/2024,33.00
2,4/16/2022,6/2/2022,47.00,12/5/2022,3/14/2023,99.00,1/3/2024,3/30/2024,87.00,7/23/2022,9/13/2022,52.00,8/30/2023,10/11/2023,42.00,7/18/2024,9/1/2024,45.00,,,
3,4/6/2020,8/10/2020,126.00,,,,,,,11/12/2020,1/18/2021,67.00,3/27/2021,6/9/2021,74.00,10/4/2021,11/30/2021,57.00,,,
I tried the pivot method but it didn't work because of my duplicate values in the first two sets of dates. Anyone have any insight into what could make this work?
since you haven't shared any of your own code am presuming you've not done anything (if you have , exellent, post for feedback) :-(
Below a stab joining your example rows to facilitate panda'ing them ... Could easily be improved.
pseudo code
import csv
key=[] # fields used to build a unique key
flattened=[] # holds a joined row
header=[]
hdrNum=1 # used when generating new column names
skipHeader=True
hdrTemplate=[] # repeating field names for joined row (incremented to provide unique column name)
mxFlds=0 # used to determine how many ',' need to be added to rows with < mxFlds
buffer=[] # holds processed rows, written out to csv
with open('flattenMe.csv', newline='') as csvfile:
spam = csv.reader(csvfile, delimiter=',' )
for row in spam:
if skipHeader == True:
header=row[0:3] #
hdrTemplate=row[3:] # we need this when adding new columns
for h in row[3:]:
header.append(h+str(hdrNum))
hdrNum=hdrNum+1
skipHeader=False # so we only do this block once
continue
if key == []:
key=row[0:3]
flattened=row
else:
thiskey=row[0:3]
if thiskey != key: #key change processing
buffer.append(flattened)
hdrNum=hdrNum+1
flattened=row
key=thiskey
else:
for fld in row[3:]:
flattened.append(fld)
if not hdrTemplate[-1]+str(hdrNum) in header: # this key set more rows than any previous
for fld in hdrTemplate:
header.append(fld+str(hdrNum))
if len(header) > mxFlds:
mxFlds=len(header)
buffer.append(flattened)
#
# write results back to a csv
#
with open('flattened.csv', 'w' ) as spam:
slice = csv.writer(spam, dialect='unix', quoting=csv.QUOTE_NONE )
slice.writerow( header )
for r in buffer:
slice.writerow( r )
#
# pandas
#
import pandas as pd
acsv = pd.read_csv('flattened.csv')
#
# show some detail
#
print(acsv)
running produces:
ID LName FName DateIn1 DateOut1 Days1 ODateIn1 ODateOut1 Odays1 ... ODateIn4 ODateOut4 Odays4 DateIn5 DateOut5 Days5 ODateIn5 ODateOut5 Odays5
0 1 Doe Jay 7/14/2023 8/14/2023 31.0 8/15/2023 4/22/2024 251.0 ... 2/28/2023 4/8/2023 39.0 NaN NaN NaN NaN NaN NaN
1 2 Smith Dude 4/16/2022 6/2/2022 47.0 7/23/2022 9/13/2022 52.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 3 Doe Jane 4/6/2020 8/10/2020 126.0 11/12/2020 1/18/2021 67.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 4 Pat May 4/11/2020 8/10/2020 126.0 10/4/2021 11/30/2021 57.0 ... 11/12/2020 1/18/2021 67.0 3/9/2021 11/5/2021 246.0 11/12/2021 12/31/2021 49.0
[4 rows x 33 columns]
feel free use/butcher/ignore as appropriate. E&O acknowledged, all critique gratefully taken onboard but will not be posting any more on this