python r pandas data-science text-analysis

Pandas: Error while searching asterisk in dataframe. Eg: busiest_hosts['host'].str.contains('***.botol.dk')

Below is what my dataframe looks like, as you would see one of my dataframe column is URL and other is timestamp count. When I am running this code: busiest_hosts[busiest_hosts['host'].str.contains('***.novo.dk')==True] I get an error: error: nothing to repeat at position 0. Which I think is because the first element of my URL is *. It seems like a python bug (my python version is 3.x). I would really appreciate if anyone could help me in resolving this.

Solution

contains assumes the string is a regex expression and interprets the * as a command to repeat the prior character or expression. You want to escape the *. And while you're at it, escape the . as well.

busiest_hosts[busiest_hosts['host'].str.contains('\*{3}\.novo\.dk')==True]

demo

busiest_hosts = pd.DataFrame(dict(host=['***.novo.dk', '007.thegap.com'], timestamp=[16, 45]))

print(busiest_hosts)

             host  timestamp
0     ***.novo.dk         16
1  007.thegap.com         45

busiest_hosts[busiest_hosts['host'].str.contains('\*{3}\.novo\.dk')==True]

          host  timestamp
0  ***.novo.dk         16

Or as OP pointed out to me ;-), just turn regex off regex=False

busiest_hosts[busiest_hosts['host'].str.contains('***.novo.dk', regex=False)==True]