I'm trying to create a dataset out of two csv files; The first csv file has more attributes (id, att1, att2, att3, label) the second one has in common with the first one the id, (id, att4, att5)
I'm trying to merge the of these files without grouping the values of the attributes, for example, the first csv file:
id, att1, att2, att3, label
1,5,20,30,0.2,10
2,5,20,30,0.2,13
1,5,50,30,0.2,20
2,5,20,30,0.5,80
1,4,10,30,0.2,90
1....
5....
3...
5....
the second csv:
id, att4, att5
1,5,6
2,7,8
3,0,2
4,1,9
5,.....
the result:
id, att1, att2, att3, label, att4, att5
1,5,20,30,0.2,10,5,6
2,5,20,30,0.2,13,7,8
1,5,50,30,0.2,20,5,6
2,5,20,30,0.5,80,7,8
1,4,10,30,0.2,90,5,6
1....
5....
3...
5....
Please I really tried a lot of possibilities using Talend, Is there a way to use python for this (I'm new to python)
Your first csv has an issue, there are only 5 columns in the header but 6 columns in the data rows. I changed it to this:
first csv:
id,id2,att1,att2,att3,label
1,5,20,30,0.2,10
2,5,20,30,0.2,13
1,5,50,30,0.2,20
2,5,20,30,0.5,80
1,4,10,30,0.2,90
second csv:
id,att4,att5
1,5,6
2,7,8
3,0,2
4,1,9
And then you can use merge
:
df1 = pd.read_csv('test.csv')
df2 = pd.read_csv('test2.csv')
df = pd.merge(df1, df2, on=['id'])
print(df)
id id2 att1 att2 att3 label att4 att5
0 1 5 20 30 0.2 10 5 6
1 1 5 50 30 0.2 20 5 6
2 1 4 10 30 0.2 90 5 6
3 2 5 20 30 0.2 13 7 8
4 2 5 20 30 0.5 80 7 8
And then do this:
df.to_csv('merged.csv', index=False)
merged.csv:
id,id2,att1,att2,att3,label,att4,att5
1,5,20,30,0.2,10,5,6
1,5,50,30,0.2,20,5,6
1,4,10,30,0.2,90,5,6
2,5,20,30,0.2,13,7,8
2,5,20,30,0.5,80,7,8