Search code examples
pythonregexspam-prevention

Python - Detect (spammy) URLS in string


So, I've been doing some research for a while now and I could't find anything about detecting a URL in a string. The problem is that most results are about detecting whether a string IS a URL, and not if it contains a URL. The 2 results that look best to me are

Regex to find urls in string in Python and Detecting a (naughty or nice) URL or link in a text string

but the first requires http://, which is not something spammers would use (:P) and the second one isn't in regex - and my limited knowledge does not know how to translate any of these. Something I have considered doing is using something dull like

spamlist = [".com",".co.uk","etc"]
for word in string:
    if word in spamlist:  
        Do().stuff()

But that would honestly do more bad than good, and I am 100% sure there is a better way using regex or anything!

So if anyone knows anything that could help me I'd be very grateful! I've only been doing python for 1-2 months and not very intensively during this period but I feel like I'm making great progress and this one thing is all that's in the way, really.

EDIT: Sorry for not specifying earlier, I am looking to use this locally, not website (apache) based or anything similar. More trying to clean out any links from files I've got hanging around.


Solution

  • As I said in the comments,

    • Detecting a (naughty or nice) URL or link in a text string 's solution is a regex and you should probably make it a raw string or escape backslashes in it when using it in Python

    • You really shouldn't reinvent the square wheel here, especially since spam filtering is an arms race domain (couldn't remember the exact English phrase for this)