Search code examples
pythonregexdictionarydefaultdict

Accessing an value in defaultdict and stripping out url portion of it


I have a very large defaultdict that has a dict within a dict, the inner dict containing html from an email body. I only want to return an http string from within the inner dict. What's the best way to go about extracting that?

Do I need to convert the dict to another data structure before using regex? Is there a better way? I'm still fairly new to Python and appreciate any pointers.

For example, what I'm working with:

defaultdict(<type 'dict'>, {16: {u'SEQ': 16, u'RFC822': u'Delivered-To: 
somebody@email.com      LOTS MORE HTML until http://the_url_I_want_to_extract.com' }}

One thing I've tried is using re.findall on defaultdict which didn't work:

confirmation_link = re.findall('Click this link to confirm your registration:<br />"
(.*?)"', body)

for conf in confirmation_link:
    print conf

Error:

line 177, in findall
return _compile(pattern, flags).findall(string)
TypeError: expected string or buffer

Solution

  • You can only only use the regular expression, once you've iterated over your dictionary for the corresponding value:

    import re
    
    d = defaultdict(<type 'dict'>, {16: {u'SEQ': 16, u'RFC822': u'Delivered-To: somebody@email.com      LOTS MORE HTML until http://the_url_I_want_to_extract.com' }}
    
    for k, v in d.iteritems():
        #v is the dictionary that contains your html string:
        str_with_html = v['RFC822']
    
        #this regular expression starts with matching http, and then 
        #continuing until a white space character is hit.
        match = re.search("http[^\s]+", str_with_html)
        if match:
            print match.group(0)
    

    Output:

    http://the_url_I_want_to_extract.com