I am trying to create a Discord Bot that reads users messages and detects when an Amazon link(s) is/are present in their message.
If I use a multi-line string I capture different results from when the message is used on a single line.
Here is the code I am using:
import re
AMAZON_REGEX = re.compile("(http[s]?://[a-zA-Z0-9.-]*(?:amazon|amzn).["
"a-zA-Z]+(?:.+?(?:ref=[^?]+)|.+(?= )|[^?]+))")
def extract_url(message):
foo = AMAZON_REGEX.findall(message)
return foo
user_message = """https://www.amazon.co.uk/dp/B07RLWTXKG blah blah
hello
https://www.amazon.co.uk/dp/B07RLWToop foobar"""
print(extract_url(user_message))
The result of the above code is: ['https://www.amazon.co.uk/dp/B07RLWTXKG blah blah', 'https://www.amazon.co.uk/dp/B07RLWToop']
However, if I change user_message
from a multiline string to a single line one then I get the following result: ['https://www.amazon.co.uk/dp/B07RLWTXKG blah blah hello https://www.amazon.co.uk/dp/B07RLWToop']
Why is this the case? Also, how do I capture just the URL without the rest of the users' messages?
It seems like you're having an issue with the exact regex you're using.
After parsing the link, it seems like your regex captures the following words, separated by spaces, but the newline character stops the regex from continuing. The fact that there's a newline between "blah" and "hello" in the first case is what's causing the "hello" to not be captured in the multi-line case. As you might know, there's a newline character (\n
), a bit like a
, *
and other character exist.
I'm not quite sure what format the amazon link would come in, so it's difficult to say how it should look. However, you know that the link will not contain a space, so stopping the matching when you see a space character would be optimal.
(http[s]?:\/\/[a-zA-Z0-9.-]*(?:amazon|amzn).[a-zA-Z]+(?:.+?(?:ref=[^?]+)|.+(?= )|[^?]+))
(http[s]?:\/\/[a-zA-Z0-9.-]*(?:amazon|amzn).[a-zA-Z]+(?:.+?(?:ref=[^?]+)|[^ ]+(?= )|[^?]+))
In the example above, I turned one of your last .
(basically "match all characters") into [^ ]
(basically "match all except for a space"). This means you won't start matching the words following the spaces after the word.
Good luck with the Discord bot!