Search code examples
pythondataframeconcatenation

Python - Add Missing Data to New Dataframe


I am comparing two dataframes by looping through both and determining if data (row) from one does not exist in the other. If there is no match, I want to add that specific row to a third dataframe. I was able to achieve this with the append() method, but Python warns me that it is deprecated and tells me to use concat instead. Unfortunately, I cannot get concat to achieve the same result.

import pandas as pd

df1 = pd.DataFrame(
    {
         "Animal": ["Corgi", 
                       "Labrador",
                       "Orange Cat",
                       "Black Cat"],
         "Owner": ["Bob",
                        "Joe",
                        "Sara",
                        "Glen"]         
     })

df2 = pd.DataFrame(
    {
         "Animal": ["Corgi", 
                       "Labrador",
                       "Black Cat"],
         "Owner": ["Bob",
                        "Joe",
                        "Glen"]         
     })

dfF = pd.DataFrame(columns=["Animal", "Owner"])

df1Sort = df1.sort_values(by=["Animal"]).reset_index(drop=True)
df2Sort = df2.sort_values(by=["Animal"]).reset_index(drop=True)

for ind in df1Sort.index:
    for subInd in df2Sort.index:   
        if df1Sort["Animal"][ind] == df2Sort["Animal"][subInd]:
            print("Found match:", 
                  df1Sort["Animal"][ind],
                  "-",
                  df1Sort["Owner"][ind])            
            break
        elif df2Sort.index[subInd] == df2Sort.index[-1]:
            print("No match for:",
                  df1Sort["Animal"][ind],
                  "-",
                  df1Sort["Owner"][ind])
            dfF = pd.concat([dfF, df1Sort.iloc[ind]], axis=0, ignore_index=True)
            break

print(dfF)

The output is:

  • Found match: Black Cat - Glen
  • Found match: Corgi - Bob
  • Found match: Labrador - Joe
  • No match for: Orange Cat - Sara
  • Animal Owner 0
  • 0 NaN NaN Orange Cat
  • 1 NaN NaN Sara

I would like the new dataframe to read as such:

  • Animal Owner
  • 0 Orange Cat Sara

How do I alter the concat line to achieve this?


Solution

  • This issue happens because df1Sort.iloc[ind] returns a Series, not a Dataframe, and pd.concat expects a Dataframe. You need to wrap the Series in a DataFrame to ensure proper concatenation.

    dfF = pd.concat([dfF, pd.DataFrame([df1Sort.iloc[ind]])], ignore_index=True)
    

    Here's the updated code

    import pandas as pd
    
    df1 = pd.DataFrame({
        "Animal": ["Corgi", "Labrador", "Orange Cat", "Black Cat"],
        "Owner": ["Bob", "Joe", "Sara", "Glen"]
    })
    
    df2 = pd.DataFrame({
        "Animal": ["Corgi", "Labrador", "Black Cat"],
        "Owner": ["Bob", "Joe", "Glen"]
    })
    
    dfF = pd.DataFrame(columns=["Animal", "Owner"])
    
    df1Sort = df1.sort_values(by=["Animal"]).reset_index(drop=True)
    df2Sort = df2.sort_values(by=["Animal"]).reset_index(drop=True)
    
    for ind in df1Sort.index:
        for subInd in df2Sort.index:
            if df1Sort["Animal"][ind] == df2Sort["Animal"][subInd]:
                print("Found match:", df1Sort["Animal"][ind], "-", df1Sort["Owner"][ind])
                break
            elif df2Sort.index[subInd] == df2Sort.index[-1]:
                print("No match for:", df1Sort["Animal"][ind], "-", df1Sort["Owner"][ind])
                dfF = pd.concat([dfF, pd.DataFrame([df1Sort.iloc[ind]])], ignore_index=True)
                break
    
    print(dfF)
    

    I hope this will help you a little.