python beautifulsoup html extract persian

Text Extracting: Used All Methods, Yet Stuck

I want to extract a few text out of a webpage. I searched StackOverFlow (as well as other sites) to find a proper method. I used HTML2TEXT, BEAUTIFULSOUP, NLTK and some other manual methods to do extraction and I failed for example:

HTML2TEXT works on offline (=saved pages) and I need to do it online.
BS4 won't work properly on Unicode (My page is in UTF8 Persian encoding) and it won't extract the text. It also returns HTML tags\codes. I only need rendered text.
NLTK won't work on my Persian text. Even while trying to open my page with urllib.request.urlopen I encounter some errors. So as you see I'm so much stuck after trying several methods.

Here's my target URL: http://vynylyn.yolasite.com/page2.php I want to extract only Persian paragraphs without tags\codes.

(Note: I use Eclipse Kepler w\ Python 34 also I want to extract text then I want to do POS Tagging, Word\Sentence Tokenizing, etc on the text.)

What are my options to get this working?

Solution

I'd go for your second option at first. BeautifulSoup 4 should (and does) definitely support unicode (note it's UTF-8, a global character encoding, so there's nothing Persian about it).

And yes, you will get tags, as it's an HTML page. Try searching for a unique ID, or look at the HTML structure on the page(s). For your example, look for element main and then content elements below that, or maybe use div#I1_sys_txt in that specific page. Once you have your element, you just need to call get_text().

Try this (now in Python 3):

#!/usr/bin/env python3
import requests
from bs4 import BeautifulSoup

content = requests.get('http://vynylyn.yolasite.com/page2.php')
soup = BeautifulSoup(content.text)

tag = soup.find('div', id='I1_sys_txt')
print(tag.get_text() if tag else "<none found>")