I have a problem using BeautifulSoup4... (I'm quite a Python/BeautifulSoup newbie, so forgive me if i'm dumb)
Why does the following code:
from bs4 import BeautifulSoup
soup_ko = BeautifulSoup('<select><option>foo</option><option>bar & baz</option><option>qux</option></select>')
soup_ok = BeautifulSoup('<select><option>foo</option><option>bar and baz</option><option>qux</option></select>')
print soup_ko.find_all('option')
print soup_ok.find_all('option')
produce the following output:
[<option>foo</option>, <option>bar & baz</option>]
[<option>foo</option>, <option>bar and baz</option>, <option>qux</option>]
i was expecting the same result, an array of my 3 options... but BeautifulSoup seems to dislike the ampersand in the text? How can i get rid of this and get a correct array without editing my HTML (or by transforming/converting it)?
thanks,
Edit: Seems like a 4.2.0 bug... i downloaded both 4.2.0 and 4.2.1 versions (from http://www.crummy.com/software/BeautifulSoup/bs4/download/4.2/beautifulsoup4-4.2.0.tar.gz and http://www.crummy.com/software/BeautifulSoup/bs4/download/4.2/beautifulsoup4-4.2.1.tar.gz), unzip it in my script folder, change my code to:
import sys
sys.path.insert(0, "beautifulsoup4-" + sys.argv[1])
from bs4 import BeautifulSoup, __version__
print "Beautiful Soup %s" % __version__
soup_ko = BeautifulSoup('<select><option>foo</option><option>bar & baz</option><option>qux</option></select>')
print soup_ko.find_all('option')
and got the results:
15:24:38 pataluc ~ % python stack.py 4.2.0
Beautiful Soup 4.2.0
[<option>foo</option>, <option>bar & baz</option>]
15:24:41 pataluc ~ % python stack.py 4.2.1
Beautiful Soup 4.2.1
[<option>foo</option>, <option>bar & baz</option>, <option>qux</option>]
so i guess my question is closed. thanks for your comments who made me realize it was a version issue.
As i said in the edited first post, it was a bug in BeautifulSoup 4.2.0, i downloaded 4.2.1 and the bug is gone.