I have a very large defaultdict that has a dict within a dict, the inner dict containing html from an email body. I only want to return an http string from within the inner dict. What's the best way to go about extracting that?
Do I need to convert the dict to another data structure before using regex? Is there a better way? I'm still fairly new to Python and appreciate any pointers.
For example, what I'm working with:
defaultdict(<type 'dict'>, {16: {u'SEQ': 16, u'RFC822': u'Delivered-To:
somebody@email.com LOTS MORE HTML until http://the_url_I_want_to_extract.com' }}
One thing I've tried is using re.findall on defaultdict which didn't work:
confirmation_link = re.findall('Click this link to confirm your registration:<br />"
(.*?)"', body)
for conf in confirmation_link:
print conf
Error:
line 177, in findall
return _compile(pattern, flags).findall(string)
TypeError: expected string or buffer
You can only only use the regular expression, once you've iterated over your dictionary for the corresponding value:
import re
d = defaultdict(<type 'dict'>, {16: {u'SEQ': 16, u'RFC822': u'Delivered-To: somebody@email.com LOTS MORE HTML until http://the_url_I_want_to_extract.com' }}
for k, v in d.iteritems():
#v is the dictionary that contains your html string:
str_with_html = v['RFC822']
#this regular expression starts with matching http, and then
#continuing until a white space character is hit.
match = re.search("http[^\s]+", str_with_html)
if match:
print match.group(0)
Output:
http://the_url_I_want_to_extract.com