Search code examples
pythonregeximageemotion

Relace emotion html tags with emotion names using Python


In the html files, it's common to find that people insert emotion marks. Usually, it looks like this:

<img alt="" border="0" class="inlineimg" src="images/smilies/smile.png" title="Smile"/>

If there is only one such emotion img, it's convenient to replace them with its emotion title. For example:

def remove_single_img_tags(data):
    p = re.compile(r'<img.*?/>')
    img = re.findall(p, data)
    emotion = img[0].split('title=')[1].split('/')[0]
    return p.sub(emotion, data) 

test1 = u'I love you.<img alt="" border="0" class="inlineimg" src="images/smilies/smile.png" title="Smile"/>.I hate bad men.'

remove_single_img_tags(test1)

However, if there are multiple emotion marks in the form of img html tags. It's not so easy.

def remove_img_tags(data):
    p = re.compile(r'<img.*?/>')
    img = re.findall(p, data)
    emotions = ()
    for i in img:
        emotion = i.split('title=')[1].split('/')[0]
        emotions[i] = emotion
    return p.sub(emotions, data)

test2 = u'I love you<img alt="" border="0" class="inlineimg" src="images/smilies/smile.png" title="Smile"/>I hate bad men <img alt="" border="0" class="inlineimg" src="images/smilies/mad.png" title="Mad"/>'

remove_img_tags(test2)

The python script above does not work. TypeError: 'tuple' object does not support item assignment


Solution

  • From >>> help(re.sub):

    Help on function sub in module re:
    
    sub(pattern, repl, string, count=0, flags=0)
        Return the string obtained by replacing the leftmost
        non-overlapping occurrences of the pattern in string by the
        replacement repl.  repl can be either a string or a callable;
        if a string, backslash escapes in it are processed.  If it is
        a callable, it's passed the match object and must return
        a replacement string to be used.
    

    You can supply a callable for the replacement text that takes the match as an argument and returns the replacement text.

    >>> p = re.compile(r'<img.*?/>')
    # repeat test string 5 times as input data
    >>> data = '<img alt="" border="0" class="inlineimg" src="images/smilies/smile.png" title="Smile"/>' * 5
    >>> p.sub(lambda match: match.group().split('title=')[1].split('/')[0], data)
    '"Smile""Smile""Smile""Smile""Smile"'
    

    EDIT here are the other examples:

    >>> test1 = u'I love you.<img alt="" border="0" class="inlineimg" src="images/smilies/smile.png" title="Smile"/>.I hate bad men.'   >>>
    >>> p.sub(lambda match: match.group().split('title=')[1].split('/')[0], test1)
    u'I love you."Smile".I hate bad men.'
    >>> test2 = u'I love you<img alt="" border="0" class="inlineimg" src="images/smilies/smile.png" title="Smile"/>I hate bad men <img alt="" border="0" class="inlineimg" src="images/smilies/mad.png" title="Mad"/>'
    >>> p.sub(lambda match: match.group().split('title=')[1].split('/')[0], test2)
    u'I love you"Smile"I hate bad men "Mad"'
    

    I would also suggest adding the title match to your regex, so that you can extract it by group index:

    >>> p = re.compile(r'<img.*?title=(".*?")/>')
    >>> p.sub(lambda match: match.group(1), test2)
    u'I love you"Smile"I hate bad men "Mad"'