Search code examples
htmlbeautifulsouppycurlstringiotype-conversion

Convert io.BytesIO to io.StringIO to parse HTML page


I'm trying to parse a HTML page I retrieved through pyCurl but the pyCurl WRITEFUNCTION is returning the page as BYTES and not string, so I'm unable to Parse it using BeautifulSoup.

Is there any way to convert io.BytesIO to io.StringIO?

Or Is there any other way to parse the HTML page?

I'm using Python 3.3.2.


Solution

  • A naive approach:

    # assume bytes_io is a `BytesIO` object
    byte_str = bytes_io.read()
    
    # Convert to a "unicode" object
    text_obj = byte_str.decode('UTF-8')  # Or use the encoding you expect
    
    # Use text_obj how you see fit!
    # io.StringIO(text_obj) will get you to a StringIO object if that's what you need