Search code examples
pythonpandasdataframedataset

Merge datasets using pandas


Below I have code which was provided to me in order to join 2 datasets.

import pandas as pd
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt

df= pd.read_csv("student/student-por.csv")
ds= pd.read_csv("student/student-mat.csv")

print("before merge")

print(df)
print(ds)

print("After merging:")

dq = pd.merge(df,ds,by=c("school","sex","age","address","famsize","Pstatus","Medu","Fedu","Mjob","Fjob","reason","nursery","internet"))

print(dq)

I get this error:

Traceback (most recent call last):
  File "/Users/PycharmProjects/datamining/main.py", line 15, in <module>
    dq = pd.merge(df, ds,by=c ("school","sex","age","address","famsize","Pstatus","Medu","Fedu","Mjob","Fjob","reason","nursery","internet"))
NameError: name 'c' is not defined

Any help would be great, I've tried messing about with it for a while. I believe the 'by=c' is the issue.

Thanks


Solution

  • Hi 👋🏻 Hope you are doing well!

    The error is happening because of the c symbol in the arguments of the merge function. Also merge function has a different signature and it doesn't have the argument by but instead it should be on, which accepts only the list of columns 🙂 So in summary it should something similar to this:

    import pandas as pd
    
    df = pd.read_csv("student/student-por.csv")
    ds = pd.read_csv("student/student-mat.csv")
    
    print("Before merge.")
    print(df)
    print(ds)
    
    print("After merge.")
    dq = pd.merge(
        left=df,
        right=ds,
        on=[
            "school",
            "sex",
            "age",
            "address",
            "famsize",
            "Pstatus",
            "Medu",
            "Fedu",
            "Mjob",
            "Fjob",
            "reason",
            "nursery",
            "internet",
        ],
    )
    print(dq)
    

    Docs: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html