I have 2 dataframes
, each of the dataframe
has a RegionName
column. I would like to create a new column that will has the value 'true' if the RegionName
in dataframe
A appears under RegionName
in dataframe
B.
I have written a lambda function that attempts to do this but returns false for all items in my dataframe
(which is not the case). I have also attempted to use np.where()
but to no avail
housing = convert_housing_data_to_quarters()
housing = housing.iloc[:,[34,35,36,37]]
university = get_list_of_university_towns()
housing = housing.reset_index()
housing['University City'] = housing.apply(lambda x: x['RegionName'] in university['RegionName'], axis=1)
This is my attempt at using np.where()
housing['University City'] = np.where(housing['RegionName'] == university['RegionName'](axis=1),'true','false')
Pandas series have dictionary-like properties. Consider the following:
s = pd.Series(['a', 'b', 'c', 'd'], index=[10, 11, 12, 13])
'a' in s # False
10 in s # True
So you are in fact checking for existence in a series index rather than series values. In addition, apply
+ lambda
is an inefficient, Python-level loop.
With Pandas you should look to vectorise operations, e.g. via pd.Series.isin
, which implicitly uses values. In addition, it will likely be efficient to make your comparison series unique first:
unique_vals = university['RegionName'].unique()
housing['University City'] = housing['RegionName'].isin(unique_vals)
This creates a Boolean series. If 0
/ 1
is a requirement, you can then convert to int
:
housing['University City'] = housing['University City'].astype(int)