Am currently stripping some HTML from the text as follows:
<p><b>Masala</b> films of <a href="/wiki/Cinema_of_India" title="Cinema of India">Indian cinema</a> are those that mix genres in one work. Typically these films freely mix <a href="/wiki/Action_film" title="Action film">action</a>, <a href="/wiki/Comedy_film" title="Comedy film">comedy</a>, <a href="/wiki/Romance_film" title="Romance film">romance</a>, and <a href="/wiki/Drama_film" title="Drama film">drama</a> or <a href="/wiki/Melodrama" title="Melodrama">melodrama</a>.<sup class="reference" id="cite_ref-Ganti2004_1-0"><a href="#cite_note-Ganti2004-1"><span>[</span>1<span>]</span></a></sup> They tend to be <a href="/wiki/Musical_film" title="Musical film">musicals</a> that include songs filmed in picturesque locations. The genre is named after the <a href="/wiki/Spice_mix" title="Spice mix">masala</a>, a mixture of <a href="/wiki/Spice" title="Spice">spices</a> in <a href="/wiki/Indian_cuisine" title="Indian cuisine">Indian cuisine</a>.<sup class="reference" id="cite_ref-2"><a href="#cite_note-2"><span>[</span>2<span>]</span></a></sup> According to <i><a href="/wiki/The_Hindu" title="The Hindu">The Hindu</a></i>, masala is the most popular genre of Indian cinema.<sup class="reference" id="cite_ref-3"><a href="#cite_note-3"><span>[</span>3<span>]</span></a></sup></p>
The stripper code, I am using is as follows:
from HTMLParser import HTMLParser
class MLStripper(HTMLParser):
def __init__(self):
self.reset()
self.fed = []
def handle_data(self, d):
self.fed.append(d)
def get_data(self):
return ''.join(self.fed)
def strip_tags(html):
s = MLStripper()
print html
s.feed(html)
return s.get_data()
When I try to strip the paragraph above, I seem to be obtaining some issues:
para = strip_tags(paragraph)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-97-0f8917286c8e> in <module>()
2 for key, val in film_links.items():
3 paragraph = get_description_from_url( val, key)
----> 4 para = strip_tags(paragraph)
5 film_genre_with_des.append([key, val, para])
<ipython-input-91-0c0e68f587c6> in strip_tags(html)
13 s = MLStripper()
14 print html
---> 15 s.feed(html)
16 return s.get_data()
/Users/ruby/anaconda/lib/python2.7/HTMLParser.pyc in feed(self, data)
114 as you want (may include '\n').
115 """
--> 116 self.rawdata = self.rawdata + data
117 self.goahead(0)
118
TypeError: cannot concatenate 'str' and 'Tag' objects
Not quite sure why this is not working. This is suitable for Python 2.7, which is the version I am using.
Alternatively, you can use BeautifulSoup
HTML parser and simply get the text
:
from bs4 import BeautifulSoup
data = '<p><b>Masala</b> films of <a href="/wiki/Cinema_of_India" title="Cinema of India">Indian cinema</a> are those that mix genres in one work. Typically these films freely mix <a href="/wiki/Action_film" title="Action film">action</a>, <a href="/wiki/Comedy_film" title="Comedy film">comedy</a>, <a href="/wiki/Romance_film" title="Romance film">romance</a>, and <a href="/wiki/Drama_film" title="Drama film">drama</a> or <a href="/wiki/Melodrama" title="Melodrama">melodrama</a>.<sup class="reference" id="cite_ref-Ganti2004_1-0"><a href="#cite_note-Ganti2004-1"><span>[</span>1<span>]</span></a></sup> They tend to be <a href="/wiki/Musical_film" title="Musical film">musicals</a> that include songs filmed in picturesque locations. The genre is named after the <a href="/wiki/Spice_mix" title="Spice mix">masala</a>, a mixture of <a href="/wiki/Spice" title="Spice">spices</a> in <a href="/wiki/Indian_cuisine" title="Indian cuisine">Indian cuisine</a>.<sup class="reference" id="cite_ref-2"><a href="#cite_note-2"><span>[</span>2<span>]</span></a></sup> According to <i><a href="/wiki/The_Hindu" title="The Hindu">The Hindu</a></i>, masala is the most popular genre of Indian cinema.<sup class="reference" id="cite_ref-3"><a href="#cite_note-3"><span>[</span>3<span>]</span></a></sup></p>'
soup = BeautifulSoup(data)
print soup.get_text()
Prints:
Masala films of Indian cinema are those that mix genres in one work. Typically these films freely mix action, comedy, romance, and drama or melodrama.[1] They tend to be musicals that include songs filmed in picturesque locations. The genre is named after the masala, a mixture of spices in Indian cuisine.[2] According to The Hindu, masala is the most popular genre of Indian cinema.[3]