Search code examples
pythonregexstringregex-lookarounds

Regex negative lookahead string with special character python


It's about content dimensions on a website. This link checker tool supports Python Regex. With the link checker I want to get information about just one content dimension.

I'd like to match all except the one with the string de_de (for the --no-follow-url option).

https://www.example.com/int_en
https://www.example.com/int_de
https://www.example.com/de_de  ##should not match or all others should match
https://www.example.com/be_de
https://www.example.com/fr_fr
https://www.example.com/gb_en
https://www.example.com/us_en
https://www.example.com/ch_de
https://www.example.com/ch_it
https://www.example.com/shop

I'm stuck somewhere inbetween these approaches:

https:\/\/www.example.com\/\bde\_de
https:\/\/www.example.com\/[^de]{2,3}[^de]
https:\/\/www.example.com\/[a-z]{2,3}\_[^d][^e]
https:\/\/www.example.com\/([a-z]{2,3}\_)(?!^de$)
https:\/\/www.example.com\/[a-z]{2,3}\_
https:\/\/www.example.com\/(?!^de\_de$)

How can I use a negative lookahead to match a string with a special character (underscore)? Can I go with something like

(?!^de_de$)

I'm new to regex, any help or input is appreciated.


Solution

  • You could try:

    https:\/\/www.example.com\/.+?(?<!de_de)\b
    

    This matches:

    https://www.example.com/shop
    

    but not:

    https://www.example.com/de_de
    

    Pythex link here

    Explanation: here we use a negative look behind (?<!de_de) applied to a word boundary (\b). This means that we have to find a word boundary not preceded by "de_de".