Search code examples
pythonpython-2.7cx-freeze

Python 2.7 cx_freeze: What are the risks of removing the encodings that I dont need?


when I perform a cx_freeze on my application I get a list of encodings as follows:

m encodings.aliases
m encodings.ascii
m encodings.base64_codec
m encodings.big5
m encodings.big5hkscs
m encodings.bz2_codec
m encodings.charmap
m encodings.cp037
m encodings.cp1006
m encodings.cp1026
m encodings.cp1140
m encodings.cp1250
m encodings.cp1251
m encodings.cp1252
m encodings.cp1253
m encodings.cp1254
m encodings.cp1255
m encodings.cp1256
m encodings.cp1257
m encodings.cp1258
m encodings.cp424
m encodings.cp437
m encodings.cp500
m encodings.cp720
m encodings.cp737
m encodings.cp775
m encodings.cp850
m encodings.cp852
m encodings.cp855
m encodings.cp856
m encodings.cp857
m encodings.cp858
m encodings.cp860
m encodings.cp861
m encodings.cp862
m encodings.cp863
m encodings.cp864
m encodings.cp865
m encodings.cp866
m encodings.cp869
m encodings.cp874
m encodings.cp875
m encodings.cp932
m encodings.cp949
m encodings.cp950
m encodings.euc_jis_2004
m encodings.euc_jisx0213
m encodings.euc_jp
m encodings.euc_kr
m encodings.gb18030
m encodings.gb2312
m encodings.gbk
m encodings.hex_codec
m encodings.hp_roman8
m encodings.hz
m encodings.idna
m encodings.iso2022_jp
m encodings.iso2022_jp_1
m encodings.iso2022_jp_2
m encodings.iso2022_jp_2004
m encodings.iso2022_jp_3
m encodings.iso2022_jp_ext
m encodings.iso2022_kr
m encodings.iso8859_1
m encodings.iso8859_10
m encodings.iso8859_11
m encodings.iso8859_13
m encodings.iso8859_14
m encodings.iso8859_15
m encodings.iso8859_16
m encodings.iso8859_2
m encodings.iso8859_3
m encodings.iso8859_4
m encodings.iso8859_5
m encodings.iso8859_6
m encodings.iso8859_7
m encodings.iso8859_8
m encodings.iso8859_9
m encodings.johab
m encodings.koi8_r
m encodings.koi8_u
m encodings.latin_1
m encodings.mac_arabic
m encodings.mac_centeuro
m encodings.mac_croatian
m encodings.mac_cyrillic
m encodings.mac_farsi
m encodings.mac_greek
m encodings.mac_iceland
m encodings.mac_latin2
m encodings.mac_roman
m encodings.mac_romanian
m encodings.mac_turkish
m encodings.mbcs
m encodings.palmos
m encodings.ptcp154
m encodings.punycode
m encodings.quopri_codec
m encodings.raw_unicode_escape
m encodings.rot_13
m encodings.shift_jis
m encodings.shift_jis_2004
m encodings.shift_jisx0213
m encodings.string_escape
m encodings.tis_620
m encodings.undefined
m encodings.unicode_escape
m encodings.unicode_internal
m encodings.utf_16
m encodings.utf_16_be
m encodings.utf_16_le
m encodings.utf_32
m encodings.utf_32_be
m encodings.utf_32_le
m encodings.utf_7
m encodings.utf_8
m encodings.utf_8_sig
m encodings.uu_codec
m encodings.zlib_codec

However at the top of every file (even the init.py) I have the following:

# encoding: utf-8

Would this be enough information to remove the rest of the encodings and is there any risk in manually excluding them through the excludes list?

buildOptions = dict(packages = [],
                    excludes = ["encoding.cp1006", "encoding.cp037"],
                    includes = [], path=[], include_files=[])

Solution

  • In most cases, most of the codecs can probably be safely excluded, but it's hard to be sure which ones are needed. They're not just used for your source files - if any code, including modules you're importing, does something like b.decode('punycode') or `u.encode('cp860'), it will need the corresponding codec.

    At a minimum you should leave ascii, utf_8, latin_1, cp1252 and mbcs there, those are common ones to use. Oh, and charmap might be a base class, so it's probably safest to leave that in.

    Notes on the other ones:

    • The codecs starting with cp are Windows/DOS codepages, and may be encountered running on Windows in different locales.
    • The iso8859 family are similar on older Unix systems (modern Linux and Mac systems tend to use UTF-8).
    • The mac family are similar for old Macs (pre OS X? I'm not sure)
    • UTF-16 might be used on Windows for handling unicode.
    • Code manipulating Python source code might use string_escape and raw_unicode_escape.
    • base64, hex and uu are different ways of transforming binary data into text (to display it, or to put it in text-only formats like JSON)
    • bz2 and zlib are compression algorithms.
    • idna and punycode are used for handling internationalised domain names
    • UTF-32 and UTF-7 are alternative ways of storing unicode, not explicitly used very often (Python can actually store strings as UTF-32 in memory, but I don't think it uses codecs.utf_32 for that).
    • Most of the others are encodings for far-eastern (Chinese, Japanese, Korean) text. It should be possible to handle that with Unicode now, but from what I've heard, some of those encodings are still in common use.

    That should give you a rough idea what your application might need, but don't forget that some library you're using might use a codec in an unexpected way. It's not normal for code to handle some of the standard codecs being missing.