Search code examples
pythonrpandasdata-sciencetext-analysis

Pandas: Error while searching asterisk in dataframe. Eg: busiest_hosts['host'].str.contains('***.botol.dk')


Below is what my dataframe looks like, as you would see one of my dataframe column is URL and other is timestamp count. When I am running this code: busiest_hosts[busiest_hosts['host'].str.contains('***.novo.dk')==True] I get an error: error: nothing to repeat at position 0. Which I think is because the first element of my URL is *. It seems like a python bug (my python version is 3.x). I would really appreciate if anyone could help me in resolving this.

enter image description here


Solution

  • contains assumes the string is a regex expression and interprets the * as a command to repeat the prior character or expression. You want to escape the *. And while you're at it, escape the . as well.

    busiest_hosts[busiest_hosts['host'].str.contains('\*{3}\.novo\.dk')==True]
    

    demo

    busiest_hosts = pd.DataFrame(dict(host=['***.novo.dk', '007.thegap.com'], timestamp=[16, 45]))
    
    print(busiest_hosts)
    
                 host  timestamp
    0     ***.novo.dk         16
    1  007.thegap.com         45
    

    busiest_hosts[busiest_hosts['host'].str.contains('\*{3}\.novo\.dk')==True]
    
              host  timestamp
    0  ***.novo.dk         16
    

    Or as OP pointed out to me ;-), just turn regex off regex=False

    busiest_hosts[busiest_hosts['host'].str.contains('***.novo.dk', regex=False)==True]