It's about content dimensions on a website. This link checker tool supports Python Regex. With the link checker I want to get information about just one content dimension.
I'd like to match all except the one with the string de_de
(for the --no-follow-url
option).
https://www.example.com/int_en
https://www.example.com/int_de
https://www.example.com/de_de ##should not match or all others should match
https://www.example.com/be_de
https://www.example.com/fr_fr
https://www.example.com/gb_en
https://www.example.com/us_en
https://www.example.com/ch_de
https://www.example.com/ch_it
https://www.example.com/shop
I'm stuck somewhere inbetween these approaches:
https:\/\/www.example.com\/\bde\_de
https:\/\/www.example.com\/[^de]{2,3}[^de]
https:\/\/www.example.com\/[a-z]{2,3}\_[^d][^e]
https:\/\/www.example.com\/([a-z]{2,3}\_)(?!^de$)
https:\/\/www.example.com\/[a-z]{2,3}\_
https:\/\/www.example.com\/(?!^de\_de$)
How can I use a negative lookahead to match a string with a special character (underscore)? Can I go with something like
(?!^de_de$)
I'm new to regex, any help or input is appreciated.
You could try:
https:\/\/www.example.com\/.+?(?<!de_de)\b
This matches:
https://www.example.com/shop
but not:
https://www.example.com/de_de
Pythex link here
Explanation: here we use a negative look behind (?<!de_de)
applied to a word boundary (\b
). This means that we have to find a word boundary not preceded by "de_de".