Search code examples
pythonbeautifulsouphtmlextractpersian

Text Extracting: Used All Methods, Yet Stuck


I want to extract a few text out of a webpage. I searched StackOverFlow (as well as other sites) to find a proper method. I used HTML2TEXT, BEAUTIFULSOUP, NLTK and some other manual methods to do extraction and I failed for example:

  • HTML2TEXT works on offline (=saved pages) and I need to do it online.
  • BS4 won't work properly on Unicode (My page is in UTF8 Persian encoding) and it won't extract the text. It also returns HTML tags\codes. I only need rendered text.
  • NLTK won't work on my Persian text. Even while trying to open my page with urllib.request.urlopen I encounter some errors. So as you see I'm so much stuck after trying several methods.

Here's my target URL: http://vynylyn.yolasite.com/page2.php I want to extract only Persian paragraphs without tags\codes.

(Note: I use Eclipse Kepler w\ Python 34 also I want to extract text then I want to do POS Tagging, Word\Sentence Tokenizing, etc on the text.)

What are my options to get this working?


Solution

  • I'd go for your second option at first. BeautifulSoup 4 should (and does) definitely support unicode (note it's UTF-8, a global character encoding, so there's nothing Persian about it).

    And yes, you will get tags, as it's an HTML page. Try searching for a unique ID, or look at the HTML structure on the page(s). For your example, look for element main and then content elements below that, or maybe use div#I1_sys_txt in that specific page. Once you have your element, you just need to call get_text().

    Try this (now in Python 3):

    #!/usr/bin/env python3
    import requests
    from bs4 import BeautifulSoup
    
    content = requests.get('http://vynylyn.yolasite.com/page2.php')
    soup = BeautifulSoup(content.text)
    
    tag = soup.find('div', id='I1_sys_txt')
    print(tag.get_text() if tag else "<none found>")