I'm experimenting with http://robobrowser.readthedocs.org/en/latest/readme.html, a new python library based on the beautiful soup library. with some help, I have returned an html page within a django app, but I can't figure out to strip the tags to give me just the text . My django app contains :
def index(request):
from django.utils.html import strip_tags
p=str(request.POST.get('p', False)) # p='https://www.yahoo.com/'
browser = RoboBrowser(history=True)
browser.open(p)
html = browser.response
stripped = strip_tags(html)
return HttpResponse(stripped )
when I look at the outputted html I see that it is the same as the original html. Also I don't think robobrowser has the text() method of beautiful soup.
I also tried (from Python code to remove HTML tags from a string):
def remove_html_markup(s):
tag = False
quote = False
out = ""
for c in s:
if c == '<' and not quote:
tag = True
elif c == '>' and not quote:
tag = False
elif (c == '"' or c == "'") and tag:
quote = not quote
elif not tag:
out = out + c
return out
Same result! How can I remove the html tags and return the text?
BeautifulSoup provides the soup::get_text()
method for extracting text from a parsed HTML document (somewhat confusingly, this is equivalent to the getText
method and the text
property). You can access the parsed HTML of the current page using browser.parsed
. So, to get the plain text of the current page, try
text = browser.parsed.get_text()