Search code examples
pythonpandaspython-datetimeisin

Pandas filtering on multiple columns and rows


I am trying to filter a dataframe based on a certain date and country name. I have extracted the columns I want the final dataframe to contain. I have also created a list of the country names that I want to filter on.

cols_to_keep = projects[['regionname', 'countryname', 'lendinginstr', 'totalamt', 'boardapprovaldate','location', 'GeoLocID', 'GeoLocName','Latitude', 'Longitude', 'Country', 'project_name']]
countries = ['Bosnia', 'Herzegovina', 'Croatia', 'Kosovo', 'Macedonia', 'Serbia', 'Slovenia']

Then I try to filter on the date that I want which is April 27th 1992.

projects['boardapprovaldate'] < pd.Timestamp(datetime.date(1992,4,27))

Then I try applying the date filter on the country names that I want by doing:

cols_to_keep[(cols_to_keep['boardapprovaldate'] < pd.Timestamp(datetime.date(1992,4,27))) & 
             (cols_to_keep['countryname'].isin(countries))].sort_values('boardapprovaldate')

which only shows me 2 rows when I should have 12 rows

enter image description here

However, if I try

str.contains() instead of isin()

method I can retrieve the correct results I want.

projects[(projects['boardapprovaldate'] < pd.Timestamp(datetime.date(1992,4,27))) & 
         ((projects['countryname'].str.contains('Bosnia')) | 
         (projects['countryname'].str.contains('Croatia')) | 
         (projects['countryname'].str.contains('Kosovo')) | 
         (projects['countryname'].str.contains('Macedonia')) | 
         (projects['countryname'].str.contains('Montenegro')) | 
         (projects['countryname'].str.contains('Serbia')) | 
         (projects['countryname'].str.contains('Slovenia')))][
    ['regionname', 
     'countryname', 
     'lendinginstr', 
     'totalamt', 
     'boardapprovaldate',
     'location', 
     'GeoLocID', 
     'GeoLocName',
     'Latitude', 
     'Longitude', 
     'Country', 
     'project_name']].sort_values('boardapprovaldate')

enter image description here

Can someone please explain the difference and why this happens?


Solution

  • Using isin checks for equality. So values like "Bosnia and Herzegovina" which are not in your countries list will not match.

    Using contains checks for sub-strings. "Bosnia" is a sub-string of "Bosnia and Herzegovina".

    Example:
    srs = pd.Series(["Bosnia and Herzegovina", "Bosnia"])
    >>> srs.isin(["Bosnia"])
    0    False
    1    True
    dtype: bool
    
    >>> srs.str.contains("Bosnia")
    0     True
    1     True
    dtype: bool