Search code examples
pythonpython-3.xpandasseriesdata-analysis

Comparing values in one column to values in a column in a second dataframe


I have 2 dataframes, each of the dataframe has a RegionName column. I would like to create a new column that will has the value 'true' if the RegionName in dataframe A appears under RegionName in dataframe B.

I have written a lambda function that attempts to do this but returns false for all items in my dataframe (which is not the case). I have also attempted to use np.where() but to no avail

housing = convert_housing_data_to_quarters()
housing = housing.iloc[:,[34,35,36,37]]
university = get_list_of_university_towns()
housing = housing.reset_index()
housing['University City'] = housing.apply(lambda x: x['RegionName'] in university['RegionName'], axis=1)

This is my attempt at using np.where()

housing['University City'] = np.where(housing['RegionName'] == university['RegionName'](axis=1),'true','false')

Solution

  • Pandas series have dictionary-like properties. Consider the following:

    s = pd.Series(['a', 'b', 'c', 'd'], index=[10, 11, 12, 13])
    
    'a' in s  # False
    10 in s   # True
    

    So you are in fact checking for existence in a series index rather than series values. In addition, apply + lambda is an inefficient, Python-level loop.

    With Pandas you should look to vectorise operations, e.g. via pd.Series.isin, which implicitly uses values. In addition, it will likely be efficient to make your comparison series unique first:

    unique_vals = university['RegionName'].unique()
    housing['University City'] = housing['RegionName'].isin(unique_vals)
    

    This creates a Boolean series. If 0 / 1 is a requirement, you can then convert to int:

    housing['University City'] = housing['University City'].astype(int)