Search code examples
pythonutf-16

Python: UTF16 decoding adds a new blank line on Windows boxes


I'm running into an issue with extra newlines on windows versus *nix platforms.

file = open('UTF16file.xml', 'rb')
html = file.read().decode('utf-16')
file.close()

regexp = re.compile(self.originalurl, re.S)
(html, changes) = regexp.subn(self.newurl, html)

file = open('UTF16file-regexed.xml', 'w+')
file.write(html.encode('utf-16'))
file.close()

Running this code on my mac works - I get my file back without the extra line breaks. So far I've tried:

  1. Encoding the regular expression as utf-16 instead of decoding the file - breaks on Windows and OSX.

  2. Writing in mode 'wb' instead of 'w+' - breaks on Windows.

Any ideas?


Solution

  • C:\Documents and Settings\Nick>python
    ActivePython 2.6.4.10 (ActiveState Software Inc.) based on
    Python 2.6.4 (r264:75706, Jan 22 2010, 16:41:54) [MSC v.1500 32 bit (Intel)]...
    Type "help", "copyright", "credits" or "license" for more information.
    >>> txt = """here
    ... is all
    ... my text n stuff."""
    >>> f = open('u16.txt','wb')
    >>> f.write(txt.encode('utf-16'))
    >>> f.close()
    >>> exit()
    
    C:\Documents and Settings\Nick>notepad u16.txt
    

    Looks like:

    here is allmy text n stuff.
    

    (though when I copy-pasted it from Notepad to FF it actually put in line breaks)...but this:

    C:\Documents and Settings\Nick>
        "C:\Program Files\Windows NT\Accessories\wordpad.exe" u16.txt
    

    Looks like:

    here 
    is all
    my text n stuff.
    

    (on Windows XP SP3 32-bit)