I have a bunch of large csv files that were extracted out of a relational database. So for example I have customers.csv
, address.csv
and customer-address.csv
that maps the key values for the relationships. I found an answer on how to merge the files here :
Python/Panda - merge csv according to join table/csv
So right now my code looks like this:
df1 = pd.read_csv(file1)
df2 = pd.read_csv(file2)
df3 = pd.read_csv(file3)
df = (df3.merge(df1, left_on='CID', right_on='ID')
.merge(df2, left_on='AID', right_on='ID', suffixes=('','_'))
.drop(['CID','AID','ID_'], axis=1))
print (df)
Now I noticed that I have files with a one to many relationship and with the code above pandas is probably overriding values when there are multiple matches for one key.
Is there a method to join files with a one to many (many to many) relationship? I'm thinking of creating a full (redundant) row for each foreign key. So basically denormalization.
The answer to my question is to perform an outer join. With the code below pandas creates a new row for every occurence of one of the id's in the left or right dataframe thus creating a denormalized table.
df1.merge(df2, left_on='CID', right_on='ID', how='outer')