I want to extract a few text out of a webpage. I searched StackOverFlow (as well as other sites) to find a proper method. I used HTML2TEXT, BEAUTIFULSOUP, NLTK and some other manual methods to do extraction and I failed for example:
Here's my target URL: http://vynylyn.yolasite.com/page2.php I want to extract only Persian paragraphs without tags\codes.
(Note: I use Eclipse Kepler w\ Python 34 also I want to extract text then I want to do POS Tagging, Word\Sentence Tokenizing, etc on the text.)
What are my options to get this working?
I'd go for your second option at first. BeautifulSoup 4 should (and does) definitely support unicode (note it's UTF-8, a global character encoding, so there's nothing Persian about it).
And yes, you will get tags, as it's an HTML page. Try searching for a unique ID, or look at the HTML structure on the page(s). For your example, look for element main
and then content elements below that, or maybe use div#I1_sys_txt
in that specific page. Once you have your element, you just need to call get_text().
Try this (now in Python 3):
#!/usr/bin/env python3
import requests
from bs4 import BeautifulSoup
content = requests.get('http://vynylyn.yolasite.com/page2.php')
soup = BeautifulSoup(content.text)
tag = soup.find('div', id='I1_sys_txt')
print(tag.get_text() if tag else "<none found>")