Search code examples

How can I remove the BOM from a UTF-8 file?

I have a file in UTF-8 encoding with BOM and want to remove the BOM. Are there any linux command-line tools to remove the BOM from the file?

$ file test.xml
test.xml:  XML 1.0 document, UTF-8 Unicode (with BOM) text, with very long lines


  • A BOM is Unicode codepoint U+FEFF; the UTF-8 encoding consists of the three hex values 0xEF, 0xBB, 0xBF.

    With bash, you can create a UTF-8 BOM with the $'' special quoting form, which implements Unicode escapes: $'\uFEFF'. So with bash, a reliable way of removing a UTF-8 BOM from the beginning of a text file would be:

    sed -i $'1s/^\uFEFF//' file.txt

    This will leave the file unchanged if it does not start with a UTF-8 BOM, and otherwise remove the BOM.

    If you are using some other shell, you might find that "$(printf '\ufeff')" produces the BOM character (that works with zsh as well as any shell without a printf builtin, provided that /usr/bin/printf is the Gnu version ), but if you want a Posix-compatible version you could use:

    sed "$(printf '1s/^\357\273\277//')" file.txt

    (The -i in-place edit flag is also a Gnu extension; this version writes the possibly-modified file to stdout.)