Below is what my dataframe looks like, as you would see one of my dataframe column is URL and other is timestamp count. When I am running this code: busiest_hosts[busiest_hosts['host'].str.contains('***.novo.dk')==True]
I get an error: error: nothing to repeat at position 0
. Which I think is because the first element of my URL is *
. It seems like a python bug (my python version is 3.x). I would really appreciate if anyone could help me in resolving this.
contains
assumes the string is a regex expression and interprets the *
as a command to repeat the prior character or expression. You want to escape the *
. And while you're at it, escape the .
as well.
busiest_hosts[busiest_hosts['host'].str.contains('\*{3}\.novo\.dk')==True]
demo
busiest_hosts = pd.DataFrame(dict(host=['***.novo.dk', '007.thegap.com'], timestamp=[16, 45]))
print(busiest_hosts)
host timestamp
0 ***.novo.dk 16
1 007.thegap.com 45
busiest_hosts[busiest_hosts['host'].str.contains('\*{3}\.novo\.dk')==True]
host timestamp
0 ***.novo.dk 16
Or as OP pointed out to me ;-), just turn regex off regex=False
busiest_hosts[busiest_hosts['host'].str.contains('***.novo.dk', regex=False)==True]