Search code examples
pythonurlfetchplaintext

Get URL's plaintext data in python


I would like to get the plain text (e.g. no html tags and entities) from a given URL. What library should I use to do that as quickly as possible?

I've tried (maybe there is something faster or better than this):

import re
import mechanize
br = mechanize.Browser()
br.open("myurl.com")
vh = br.viewing_html
//<bound method Browser.viewing_html of <mechanize._mechanize.Browser instance at 0x01E015A8>>

Thanks


Solution

  • you can use HTML2Text if the site isnt working for you you can go to HTML2Text github Repo and get it for Python

    or maybe try this:

    import urllib
    from bs4 import*
    
    html = urllib.urlopen('myurl.com').read()
    soup = BeautifulSoup(html)
    text = soup.get_text()
    print text
    

    i dont know if it gets rid of all the js and stuff but it gets rid of the HTML

    do some Google searches there are multiple other questions similar to this one

    also maybe take a look at Read2Text