I'm manipulating the 2017 developer survey results. I want to isolate those rows which contain only the string Python
in the HaveWorkedLanguage
column.
This what that df['HaveWorkedLanguage']
column looks like:
0 Swift
1 JavaScript; Python; Ruby; SQL
2 Java; PHP; Python
3 Python; R; SQL
4 NaN
5 JavaScript; PHP; Rust
6 Matlab; Python
7 CoffeeScript; Clojure; Elixir; Erlang; Haskell
8 C#; JavaScript
9 Objective-C; Swift
10 R; SQL
11 NaN
12 C; C++; Java
13 Java; JavaScript; Ruby; SQL
14 Assembly; C; C++
15 JavaScript; VB.NET
16 JavaScript
17 Python; Matlab; Rust; SQL; Swift
18 Python
19 Perl; Python
20 NaN
21 C#; JavaScript; SQL
22 Java
23 Python; SQL
24 NaN
25 Java; Scala
26 Java; JavaScript; Objective-C; Python; Swift
27 NaN
28 Python
29 NaN
...
I tried using pandas.Series.str.match which should:
Determine if each string matches a regular expression.
as shown here
import pandas as pd
df = pd.read_csv("survey_results_public.csv")
rows_w_Python = df[df['HaveWorkedLanguage'].str.match("Python", na=False)]['HaveWorkedLanguage']
The problem is that this selects those rows containing Python
as a first entry, not those containing only Python
, which resulsts in:
3 Python; R; SQL
17 Python; Matlab; Rust; SQL; Swift
18 Python
23 Python; SQL
28 Python
...
How can I keep the rows that contain only Python
?
For exact matching, ==
operator should suffice. It doesn't require regex.
df['HaveWorkedLanguage'] == 'Python'
returns a boolean filter where the value is exactly 'Python'.
Passing this filter to the DataFrame yields:
df[df['HaveWorkedLanguage'] == 'Python']
Out:
HaveWorkedLanguage
18 Python
28 Python