Search code examples
pythonunicodetravis-cipython-unicode

Travis CI encodes ü as ü


I'm writing some Unicode strings to HTML in Python. The way I do it is to use Unicode internally and only encode when output. So something like:

with open(filename, 'w') as f:
    f.write(s.encode("utf-8"))

This works just as expect on my local machine. But when it's put on to Travis CI, the generated files have ü in place of ü. Any idea?

Here is my .travis.yml:

language: python
python: 2.7.10
install: pip install -r requirements.txt
script: python main.py -d
deploy:
  provider: s3
  access_key_id: XXX
  secret_access_key:
    secure: XXX
  bucket: www.my.org
  region: us-east-1
  skip_cleanup: true
  default_text_charset: 'utf-8'
  local-dir: output

Update

The minimal Python code that can reproduce the problem is following:

from pyquery import PyQuery as pq

argurl = 'http://hackingdistributed.com/tag/bitcoin/'

d = pq(url=argurl)

authors = []
for elem in d.find("h2.post-title a"):
    pubinfo = pq(elem).parent().parent().find(".post-metadata .post-published")
    author = pq(pubinfo).find(".post-authors").html().strip()
    authors.append(author)

with open('output/test.html', 'w') as f:
    f.write(': '.join(authors).encode('utf-8'))

Check out the output/test.html to see the ü.


Solution

  • This seems to be because your browser is likely wrongly reading the file. Easiest fix to this is to encode it as UTF-8 BOM by adding the BOM marker to the start of the file.

    Here's the fixed code for writing to the file:

    with open('output/test.html', 'w') as f:
        f.write(u'\ufeff'.encode('utf-8')) # BOM marker
        f.write(': '.join(authors).encode('utf-8'))