Search code examples
pythonxmlunicoderssgb2312

How to parse RSS with GB2312 encoding in Python


I have a RSS feed shich is encoded in GB2312

When I am trying to parse it using following code:

for item in XML.ElementFromURL(feed).xpath('//item'):
    title = item.find('title').text

It is not able to parse the Feed.

Any Idea how to parse GB2312 encoded RSS feed

The error Log from Plex Media Server is below after using encoding as below

for item in XML.ElementFromURL(feed, encoding='gb2312').xpath('//item'):
        title = item.find('title').text

:

***Error Log:***
>  File "C:\Documents and Settings\subhendu.swain\Local Settings\Application Data\Plex Media Server\Plug-ins\Zaobao.bundle\Contents\Code\__init__.py", line 24, in GetDetails
    for item in XML.ElementFromURL(feed, encoding='gb2312').xpath('//item'):
  File "C:\Documents and Settings\subhendu.swain\Local Settings\Application Data\Plex Media Server\Plug-ins\Framework.bundle\Contents\Resources\Versions\2\Python\Framework\api\parsekit.py", line 81, in ElementFromURL
    return self.ElementFromString(self._core.networking.http_request(url, values, headers, cacheTime, autoUpdate, encoding, errors, immediate=True, sleep=sleep, opener=self._opener, txn_id=self._txn_id).content, isHTML=isHTML)
  File "C:\Documents and Settings\subhendu.swain\Local Settings\Application Data\Plex Media Server\Plug-ins\Framework.bundle\Contents\Resources\Versions\2\Python\Framework\api\parsekit.py", line 76, in ElementFromString
    return self._core.data.xml.from_string(string, isHTML)
  File "C:\Documents and Settings\subhendu.swain\Local Settings\Application Data\Plex Media Server\Plug-ins\Framework.bundle\Contents\Resources\Versions\2\Python\Framework\components\data.py", line 134, in from_string
    return etree.fromstring(markup)
  File "lxml.etree.pyx", line 2532, in lxml.etree.fromstring (src/lxml/lxml.etree.c:48270)
  File "parser.pxi", line 1545, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:71812)
  File "parser.pxi", line 1424, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:70673)
  File "parser.pxi", line 938, in lxml.etree._BaseParser._parseDoc (src/lxml/lxml.etree.c:67442)
  File "parser.pxi", line 539, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:63824)
  File "parser.pxi", line 625, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:64745)
  File "parser.pxi", line 565, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:64088)
XMLSyntaxError: switching encoding: encoder error, line 1, column 36

2011-09-28 09:34:33,453 (9d0) :  DEBUG (core) - Response: 404

Solution

  • I assume you are using the Plex XML API. The documentation states that you can call XML.ElementFromURL(feed, encoding='gb2312') if you know that this is really the encoding being used.

    If the XML really is encoded with GB2312, then the declaration must be <?xml version="1.0" encoding="gb2312"?> (or begin with a byte order mark, for UTF-16), otherwise the XML is invalid. If there is no encoding in the XML declaration, and no byte order mark, parsers must assume UTF-8 encoding by default, and therefore it is invalid to use any other character encoding for XML without an encoding in the declaration. Since not specifying the encoding produces an error for you, I think it is possible that the RSS feed is not valid XML.