Search code examples
bashunicodeutf-16lowercase

Batch lowercase of text files content


After half an hour searching for an answer to this, I can't think of a way to do it (without it involving opening each text file individually, selecting all and then lowercase-ing with gedit. I would like to be able to run a script, be it by commandline or preferably to include into nautilus-scripts, so that if I select the files on the GUI and rightclick to scripts and lowercase and it will be done. I know that tr is able to know how to do it, but I can't figure out how can I turn the following call to tr '[:upper:]' '[:lower:]' < input.txt > output.txt Normally, I would change input.txt to *.txt and *.txt for output.txt, but it doesn't work. Any ideas?

Extra: once that is solved, how to adapt it for nautilus-scripts? :]

Thanks!


Solution

  • Edit: This turned out to be an encoding issue - the OP's input files are UTF16.

    After a discussion in the comments, the OP copy/pasted the data from viewing with less into a pastebin: http://pastebin.com/uHmYmhpT

    It looked like this:

    <FF><FE>1^@^M^@
    ^@0^@0^@:^@0^@0^@:^@0^@9^@,^@4^@4^@2^@ ^@-^@-^@>^@ ^@0^@0^@:^@0^@0^@:^@1^@1^@,^@4^@4^@4^@^M^@
    ^@j& ^@W^@O^@K^@E^@ ^@U^@P^@^M^@
    ^@T^@H^@I^@S^@ ^@M^@O^@R^@N^@I^@N^@G^@ ^@j&^M^@
    ^@^M^@
    ^@2^@^M^@
    

    ... and so on.

    This is clearly not an ascii (or utf8) text file, and so most standard tools (sed, grep, awk, etc) will not work on it.

    The <FF><FE> at the start is a Byte Order Mark that indicates that this file is UTF16-encoded text. There is a standard tool for converting between UTF16 and UTF8, and UTF8 is compatible with ascii for alphanumeric characters so if we convert it to UTF8, then sed/grep/awk/etc will be able to edit it.

    The tool we need is iconv. Unfortunately, iconv has no in-place editing feature so we'll have to write a loop that uses a temporary file to do the conversion:

    find . -type f -name '*.srt' -print0 | while read  -d '' filename; do
        if file "$filename"|grep -q 'UTF-16 Unicode'; then
            iconv -f UTF16 -t UTF8 -o "$filename".utf8 "$filename" && mv "$filename".utf8 "$filename"
        fi
    done
    

    Then you can run the find/sed command to lowercase them. Most programs won't care that your files are now UTF8 rather than UTF16, but if you have issues then you can write a similar loop that uses iconv to put them back into UTF16 after you've lowercased them.


    If you just want to lowercase all files matching '*.txt':

    sed -i 's/.*/\L&/' *.txt
    

    But note that this will run into issues with the command line length if there's a lot of .txt files.

    If you want to do lowercasing on all files recursively, I'd use Diego's approach - but there's a couple of errors to fix:

    find . -type f -exec sed -i 's/.*/\L&/' {} +
    

    should do the trick.

    If you don't want it to be recursive, you want it to only affect '.txt' files, and you've got too many files for the sed ... *.txt to work, then use:

    find . -maxdepth 1 -type f -name '*.txt' -exec sed -i 's/.*/\L&/' {} +
    

    (-maxdepth 1 stops the recursion)

    Older versions of find won't support the -exec ... + syntax, so if you run into trouble with that then replace the + with \;. The + is preferable because it makes find invoke sed with multiple files per invocation, rather than once per file, so it's slightly more efficient.