Search code examples
gitutf-8byte-order-mark

Checkin changes to UTF8 BOM using git


I accidentally checked in a utf8 encoded text file from Windows without removing the BOM before. Now I tried to remove it in a later version and check-in this change again. It seems as git ignores the change to the BOM bytes. Is there a setting to make git let me check-in the file like it is? (I know there is a similar issue when it comes to line endings - and there is a setting for this one...)


Solution

  • If you can make this reproducible, by all means report a bug

    Here's my two cents:

    xxd -r > raw <<< "0000000: 4865 c582 c397 c3b8 0a                   He......."
    cat raw # shows "Heł×ø" in UTF8 terminals
    
    git init .
    iconv -t UTF32BE raw  > test
    git add test
    git commit -m nobom test
    iconv -t UTF32 raw  > test
    git diff # reports: "Binary files a/test and b/test differ"
    git commit -m bom test
    

    Verify different objects present:

    git rev-list --objects --all
    1d0cf0c1871a8743f947bd4582198db4fc1e72b1
    c52c2a8c211a0031e01eef5d5121d5d0b4aabc40
    4740254f8f52094afc131040afc80bb68265e78c 
    fd3c513224525b3ab94a2512cbbfa918793640eb test
    2d9da153c5febf0425437395227381d3a4784154 
    2e54d36463fee81e89423d7d80ccc5d7003aba21 test
    

    or, slightly more direct

    for h in $(git rev-list --all -- test); do git ls-tree $h; done
    100644 blob 2e54d36463fee81e89423d7d80ccc5d7003aba21    test
    100644 blob fd3c513224525b3ab94a2512cbbfa918793640eb    test
    

    This is with git 1.7.4.1 on ubuntu 64 bit


    xxd test # no bom:
    0000000: 0000 0048 0000 0065 0000 0142 0000 00d7  ...H...e...B....
    0000010: 0000 00f8 0000 000a                      ........
    
    xxd test # with bom
    0000000: fffe 0000 4800 0000 6500 0000 4201 0000  ....H...e...B...
    0000010: d700 0000 f800 0000 0a00 0000            ............