Search code examples
pythonhtmlbeautifulsoupbyte-order-mark

Python: Youtube HTML full of BOMs


I'm trying to parse youtube comments using BeautifulSoup 4 in Python 2.7. When I try for any youtube video I get text full of BOMs, not just at the file start:

<p> thank you kind sir :)</p>

One appears in almost every comment. This is not the case for other websites (guardian.co.uk). The code I'm using:

# Source (should be taken from file to allow updating but not during wip):
source_url = 'https://www.youtube.com/watch?v=aiYzrCjS02k&feature=related'

# Get html from source:
response = urllib2.urlopen(source_url)
html = response.read()

# html comes with BOM everywhere, which is real ***, get rid of it!
html = html.decode("utf-8-sig")

soup = BeautifulSoup(html)

strings = soup.findAll("div", {"class" : "comment-body"})
print strings

As you can see I've tried decoding but as soon as I soup it brings back the BOM character. Any ideas?


Solution

  • This seems to be invalid on YouTube's part, but you can't just tell them to fix it, you need a workaround.

    So, here's a simple workaround:

    # html comes with BOM everywhere, which is real ***, get rid of it!
    html = html.replace(b'\xEF\xBB\xBF', b'')
    html = html.decode("utf-8")
    

    (The b prefixes are unnecessary but harmless for Python 2.7, but they'll make your code work in Python 3… on the other hand, they'll break it for Python 2.5, so if that's more important to you, get rid of them.)

    Alternatively, you can first decode and then replace(u'\uFEFF', u''). This should have the exact same effect (decoding extra BOMs should work harmlessly). But I think it makes more sense to fix the UTF-8 then decode it, rather than trying to decode and then fixing the result.