Search code examples
macosperlutf-8batch-processingutf-16le

OSX perl to batch write filename as first line in txt file in UTF-16LE


I found a really useful bit of perl here that writes the filename of a text file to the first line of the file. I am running this from terminal in OS X Yosemite:

perl -i -pe 'BEGIN{undef $/;} s/^/\nFilename:$ARGV\n/' `find . -name '*.TXT'`

With some modification I thought it had solved my specific problem however the files I'm picking up are UTF-16LE and I've since discovered this command is writing in UTF-8 and making a real mess of the output (text is visibly correct but is not recognised in calculations in excel, filemaker etc).

After several attempts I need help with getting this script to write the filename in UTF-16LE to the start of the file. (Note: I do have a workaround now of batch convert files to UTF-8, then run this however I'd prefer to have this workflow in one step).


Solution

  • reinierpost was correct - it was more about removing the original unicode byte order mark (BOM). What worked in the end was:

    perl -i -pe 'BEGIN{undef $/;} s/\xFF\xFE/Filename:$ARGV\n/' `find . -name '*.TXT'`
    

    where the UTF-16LE BOM \xFF\xFE is replaced by my new string. For reference some other BOMs are : - iso-10646-1 > \xFE\xFF - UTF-16BE > \xFE\xFF - UTF-8 > \xEF\xBB\xBF

    I was also able to write the new text into UTF-16LE with

    perl -i -pe 'BEGIN{binmode STDIN,":encoding(utf8)";binmode STDOUT,":encoding(utf16)"; undef $/;} s/\xFF\xFE/\xFF\xFE\nFilename:$ARGV\n/' `find . -name '*.TXT'`
    

    however I now believe that my source data is a mixed bag of UTF8 and UTF16 as this last version creates a mixed set of characters between the new header and the data. Thanks reinierpost for steering me in the right direction. I remain interested if others can improve this.