Search code examples
pythonhtmlstripping

HTML Stripper causing error


Am currently stripping some HTML from the text as follows:

<p><b>Masala</b> films of <a href="/wiki/Cinema_of_India" title="Cinema of India">Indian cinema</a> are those that mix genres in one work. Typically these films freely mix <a href="/wiki/Action_film" title="Action film">action</a>, <a href="/wiki/Comedy_film" title="Comedy film">comedy</a>, <a href="/wiki/Romance_film" title="Romance film">romance</a>, and <a href="/wiki/Drama_film" title="Drama film">drama</a> or <a href="/wiki/Melodrama" title="Melodrama">melodrama</a>.<sup class="reference" id="cite_ref-Ganti2004_1-0"><a href="#cite_note-Ganti2004-1"><span>[</span>1<span>]</span></a></sup> They tend to be <a href="/wiki/Musical_film" title="Musical film">musicals</a> that include songs filmed in picturesque locations. The genre is named after the <a href="/wiki/Spice_mix" title="Spice mix">masala</a>, a mixture of <a href="/wiki/Spice" title="Spice">spices</a> in <a href="/wiki/Indian_cuisine" title="Indian cuisine">Indian cuisine</a>.<sup class="reference" id="cite_ref-2"><a href="#cite_note-2"><span>[</span>2<span>]</span></a></sup> According to <i><a href="/wiki/The_Hindu" title="The Hindu">The Hindu</a></i>, masala is the most popular genre of Indian cinema.<sup class="reference" id="cite_ref-3"><a href="#cite_note-3"><span>[</span>3<span>]</span></a></sup></p>

The stripper code, I am using is as follows:

from HTMLParser import HTMLParser

class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.fed = []
    def handle_data(self, d):
        self.fed.append(d)
    def get_data(self):
        return ''.join(self.fed)

def strip_tags(html):
    s = MLStripper()
    print html
    s.feed(html)
    return s.get_data()

When I try to strip the paragraph above, I seem to be obtaining some issues:

para = strip_tags(paragraph)

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-97-0f8917286c8e> in <module>()
      2 for key, val in film_links.items():
      3     paragraph = get_description_from_url( val, key)
----> 4     para = strip_tags(paragraph)
      5     film_genre_with_des.append([key, val, para])

<ipython-input-91-0c0e68f587c6> in strip_tags(html)
     13     s = MLStripper()
     14     print html
---> 15     s.feed(html)
     16     return s.get_data()

/Users/ruby/anaconda/lib/python2.7/HTMLParser.pyc in feed(self, data)
    114         as you want (may include '\n').
    115         """
--> 116         self.rawdata = self.rawdata + data
    117         self.goahead(0)
    118 

TypeError: cannot concatenate 'str' and 'Tag' objects

Not quite sure why this is not working. This is suitable for Python 2.7, which is the version I am using.


Solution

  • Alternatively, you can use BeautifulSoup HTML parser and simply get the text:

    from bs4 import BeautifulSoup
    
    data = '<p><b>Masala</b> films of <a href="/wiki/Cinema_of_India" title="Cinema of India">Indian cinema</a> are those that mix genres in one work. Typically these films freely mix <a href="/wiki/Action_film" title="Action film">action</a>, <a href="/wiki/Comedy_film" title="Comedy film">comedy</a>, <a href="/wiki/Romance_film" title="Romance film">romance</a>, and <a href="/wiki/Drama_film" title="Drama film">drama</a> or <a href="/wiki/Melodrama" title="Melodrama">melodrama</a>.<sup class="reference" id="cite_ref-Ganti2004_1-0"><a href="#cite_note-Ganti2004-1"><span>[</span>1<span>]</span></a></sup> They tend to be <a href="/wiki/Musical_film" title="Musical film">musicals</a> that include songs filmed in picturesque locations. The genre is named after the <a href="/wiki/Spice_mix" title="Spice mix">masala</a>, a mixture of <a href="/wiki/Spice" title="Spice">spices</a> in <a href="/wiki/Indian_cuisine" title="Indian cuisine">Indian cuisine</a>.<sup class="reference" id="cite_ref-2"><a href="#cite_note-2"><span>[</span>2<span>]</span></a></sup> According to <i><a href="/wiki/The_Hindu" title="The Hindu">The Hindu</a></i>, masala is the most popular genre of Indian cinema.<sup class="reference" id="cite_ref-3"><a href="#cite_note-3"><span>[</span>3<span>]</span></a></sup></p>'
    
    soup = BeautifulSoup(data)
    print soup.get_text()
    

    Prints:

    Masala films of Indian cinema are those that mix genres in one work. Typically these films freely mix action, comedy, romance, and drama or melodrama.[1] They tend to be musicals that include songs filmed in picturesque locations. The genre is named after the masala, a mixture of spices in Indian cuisine.[2] According to The Hindu, masala is the most popular genre of Indian cinema.[3]