Search code examples
data-structuresdenormalization

Python/Pandas - merging one to many csv for denormalization


I have a bunch of large csv files that were extracted out of a relational database. So for example I have customers.csv , address.csv and customer-address.csv that maps the key values for the relationships. I found an answer on how to merge the files here : Python/Panda - merge csv according to join table/csv

So right now my code looks like this:

df1 = pd.read_csv(file1)
df2 = pd.read_csv(file2)
df3 = pd.read_csv(file3)

df = (df3.merge(df1, left_on='CID', right_on='ID')
         .merge(df2, left_on='AID', right_on='ID', suffixes=('','_'))
         .drop(['CID','AID','ID_'], axis=1))
print (df)

Now I noticed that I have files with a one to many relationship and with the code above pandas is probably overriding values when there are multiple matches for one key.

Is there a method to join files with a one to many (many to many) relationship? I'm thinking of creating a full (redundant) row for each foreign key. So basically denormalization.


Solution

  • The answer to my question is to perform an outer join. With the code below pandas creates a new row for every occurence of one of the id's in the left or right dataframe thus creating a denormalized table.

    df1.merge(df2, left_on='CID', right_on='ID', how='outer')