I have a dataset of headlines, such as
http://www.stackoverflow.com/lifestyle/tech/this-is-a-very-nice-headline-my-friend/2013/04/26/acjhrjk-2e1-1krjke4-9el8c-2eheje_story.html?tid=sm_fb
http://www.stackoverflow.com/2015/07/15/sports/baseball/another-very-nice.html?smid=tw-somedia&seid=auto
http://worldnews.stack.com/news/2013/07/22/54216-hello-another-one-here?lite
http://www.stack.com/article_email/hello-one-here-that-is-cool-1545545554-lMyQjAxMTAHFJELMDgxWj
http://www.stack.com/2013/11/13/tech/tricky-one/the-real-one/index.html
http://www.stack.com/2013/11/13/tech/the-good-one.html
http://www.stack.com/news/science-and-technology/54512-hello-world-here-is-a-weird-character#b02g07f20b14
I need to extract from these kind of links the proper headline, that is:
so the rule seems to find the longest string of the form word1-word2-word3
- that has a /
at the right or left border and without considering
acjhrjk-2e1-1krjke4-9el8c-2eheje
in the first link, or 54216
in the third one ,.html
.How can I do that using regex in Python? I believe regex is the only viable solution here unfortunately. Packages such as yurl
or urlparse
can capture the path of the url, but then I am back to using regex to get the headline..
Many thanks!
After all, regular expressions might not be your best bet.
However, with the specifications you came up with, you could do the following:
import re
urls = ['http://www.stackoverflow.com/lifestyle/tech/this-is-a-very-nice-headline-my-friend/2013/04/26/acjhrjk-2e1-1krjke4-9el8c-2eheje_story.html?tid=sm_fb',
'http://www.stackoverflow.com/2015/07/15/sports/baseball/another-very-nice.html?smid=tw-somedia&seid=auto',
'http://worldnews.stack.com/news/2013/07/22/54216-hello-another-one-here?lite',
'http://www.stack.com/article_email/hello-one-here-that-is-cool-1545545554-lMyQjAxMTAHFJELMDgxWj',
'http://www.stack.com/2013/11/13/tech/tricky-one/the-real-one/index.html',
'http://www.stack.com/2013/11/13/tech/the-good-one.html',
'http://www.stack.com/news/science-and-technology/54512-hello-world-here-is-a-weird-character#b02g07f20b14']
regex = re.compile(r'(?<=/)([-\w]+)(?=[.?/#]|$)')
digits = re.compile(r'-?\d{3,}-?')
for url in urls:
substrings = regex.findall(url)
longest = max(substrings, key=len)
headline = re.sub(digits, '', longest)
print headline
this-is-a-very-nice-headline-my-friend
another-very-nice
hello-another-one-here
hello-one-here-that-is-coollMyQjAxMTAHFJELMDgxWj
the-real-one
the-good-one
hello-world-here-is-a-weird-character
See a demo on ideone.com.
Here, the regex uses lookarounds to look for a /
behind and one of .?/#
ahead. Any word character and dash in between is captured.
This is not very specific but if you're looking for the longest substring and eliminate more then three consecutive digits afterwards, it might be a good starting point.
As already said in the comments, you might perhaps be better off using linguistic tools.