I am trying to filter a dataframe based on a certain date and country name. I have extracted the columns I want the final dataframe to contain. I have also created a list of the country names that I want to filter on.
cols_to_keep = projects[['regionname', 'countryname', 'lendinginstr', 'totalamt', 'boardapprovaldate','location', 'GeoLocID', 'GeoLocName','Latitude', 'Longitude', 'Country', 'project_name']]
countries = ['Bosnia', 'Herzegovina', 'Croatia', 'Kosovo', 'Macedonia', 'Serbia', 'Slovenia']
Then I try to filter on the date that I want which is April 27th 1992.
projects['boardapprovaldate'] < pd.Timestamp(datetime.date(1992,4,27))
Then I try applying the date filter on the country names that I want by doing:
cols_to_keep[(cols_to_keep['boardapprovaldate'] < pd.Timestamp(datetime.date(1992,4,27))) &
(cols_to_keep['countryname'].isin(countries))].sort_values('boardapprovaldate')
which only shows me 2 rows when I should have 12 rows
However, if I try
str.contains() instead of isin()
method I can retrieve the correct results I want.
projects[(projects['boardapprovaldate'] < pd.Timestamp(datetime.date(1992,4,27))) &
((projects['countryname'].str.contains('Bosnia')) |
(projects['countryname'].str.contains('Croatia')) |
(projects['countryname'].str.contains('Kosovo')) |
(projects['countryname'].str.contains('Macedonia')) |
(projects['countryname'].str.contains('Montenegro')) |
(projects['countryname'].str.contains('Serbia')) |
(projects['countryname'].str.contains('Slovenia')))][
['regionname',
'countryname',
'lendinginstr',
'totalamt',
'boardapprovaldate',
'location',
'GeoLocID',
'GeoLocName',
'Latitude',
'Longitude',
'Country',
'project_name']].sort_values('boardapprovaldate')
Can someone please explain the difference and why this happens?
Using isin
checks for equality. So values like "Bosnia and Herzegovina" which are not in your countries
list will not match.
Using contains
checks for sub-strings. "Bosnia" is a sub-string of "Bosnia and Herzegovina".
srs = pd.Series(["Bosnia and Herzegovina", "Bosnia"])
>>> srs.isin(["Bosnia"])
0 False
1 True
dtype: bool
>>> srs.str.contains("Bosnia")
0 True
1 True
dtype: bool