Search code examples
pythonpandasstring-matching

Select rows (with multiple strings) in pandas dataframe that contain only a given string


I'm manipulating the 2017 developer survey results. I want to isolate those rows which contain only the string Python in the HaveWorkedLanguage column.

This what that df['HaveWorkedLanguage'] column looks like:

0                                                 Swift
1                         JavaScript; Python; Ruby; SQL
2                                     Java; PHP; Python
3                                        Python; R; SQL
4                                                   NaN
5                                 JavaScript; PHP; Rust
6                                        Matlab; Python
7        CoffeeScript; Clojure; Elixir; Erlang; Haskell
8                                        C#; JavaScript
9                                    Objective-C; Swift
10                                               R; SQL
11                                                  NaN
12                                         C; C++; Java
13                          Java; JavaScript; Ruby; SQL
14                                     Assembly; C; C++
15                                   JavaScript; VB.NET
16                                           JavaScript
17                     Python; Matlab; Rust; SQL; Swift
18                                               Python
19                                         Perl; Python
20                                                  NaN
21                                  C#; JavaScript; SQL
22                                                 Java
23                                          Python; SQL
24                                                  NaN
25                                          Java; Scala
26         Java; JavaScript; Objective-C; Python; Swift
27                                                  NaN
28                                               Python
29                                                  NaN
...

I tried using pandas.Series.str.match which should:

Determine if each string matches a regular expression.

as shown here

import pandas as pd
df = pd.read_csv("survey_results_public.csv")
rows_w_Python = df[df['HaveWorkedLanguage'].str.match("Python", na=False)]['HaveWorkedLanguage']

The problem is that this selects those rows containing Python as a first entry, not those containing only Python, which resulsts in:

3                                        Python; R; SQL
17                     Python; Matlab; Rust; SQL; Swift
18                                               Python
23                                          Python; SQL
28                                               Python
...

How can I keep the rows that contain only Python?


Solution

  • For exact matching, == operator should suffice. It doesn't require regex.

    df['HaveWorkedLanguage'] == 'Python' returns a boolean filter where the value is exactly 'Python'.

    Passing this filter to the DataFrame yields:

    df[df['HaveWorkedLanguage'] == 'Python']
    Out: 
       HaveWorkedLanguage
    18             Python
    28             Python