Search code examples
pythonpandascsvtalend

Keeping duplicates of csv files when merging


I'm trying to create a dataset out of two csv files; The first csv file has more attributes (id, att1, att2, att3, label) the second one has in common with the first one the id, (id, att4, att5)

I'm trying to merge the of these files without grouping the values of the attributes, for example, the first csv file:

id, att1, att2, att3, label

1,5,20,30,0.2,10

2,5,20,30,0.2,13

1,5,50,30,0.2,20

2,5,20,30,0.5,80

1,4,10,30,0.2,90

1....

5....

3...

5....

the second csv:

id, att4, att5

1,5,6

2,7,8

3,0,2

4,1,9

5,.....

the result:

id, att1, att2, att3, label, att4, att5

1,5,20,30,0.2,10,5,6

2,5,20,30,0.2,13,7,8

1,5,50,30,0.2,20,5,6

2,5,20,30,0.5,80,7,8

1,4,10,30,0.2,90,5,6

1....

5....

3...

5....

Please I really tried a lot of possibilities using Talend, Is there a way to use python for this (I'm new to python)


Solution

  • Your first csv has an issue, there are only 5 columns in the header but 6 columns in the data rows. I changed it to this:

    first csv:

    id,id2,att1,att2,att3,label
    1,5,20,30,0.2,10
    2,5,20,30,0.2,13
    1,5,50,30,0.2,20
    2,5,20,30,0.5,80
    1,4,10,30,0.2,90
    

    second csv:

    id,att4,att5
    1,5,6
    2,7,8
    3,0,2
    4,1,9
    

    And then you can use merge:

    df1 = pd.read_csv('test.csv')
    df2 = pd.read_csv('test2.csv')
    
    df = pd.merge(df1, df2, on=['id'])
    print(df)
    
       id  id2  att1  att2  att3  label  att4  att5
    0   1    5    20    30   0.2     10     5     6
    1   1    5    50    30   0.2     20     5     6
    2   1    4    10    30   0.2     90     5     6
    3   2    5    20    30   0.2     13     7     8
    4   2    5    20    30   0.5     80     7     8
    

    And then do this:

    df.to_csv('merged.csv', index=False)
    

    merged.csv:

    id,id2,att1,att2,att3,label,att4,att5
    1,5,20,30,0.2,10,5,6
    1,5,50,30,0.2,20,5,6
    1,4,10,30,0.2,90,5,6
    2,5,20,30,0.2,13,7,8
    2,5,20,30,0.5,80,7,8