Search code examples
gitencodinggithooks

how to use git diff --name-only with non ascii file names


I have a pre-commit hook that runs

files=`git diff --cached --name-only --diff-filter=ACMR | grep -E "$extension_regex"`

and performs some formatting on those files before committing.

However, I have some files that contain non-ascii letters, and realized those files weren't being formatted.

After some debugging, found that it was because git diff outputted those file names with escaped characters and surrounded with double quotes, for example:

"\341\203\236\341\203\220\341\203\240\341\203\220\341\203\233\341\203\224\341\203\242\341\203\240\341\203\224\341\203\221\341\203\230.ext"

I tried to modify the regex pattern to accept names surrounded with quotes, and even tried removing those quotes, but anywhere I try to access the file it can't be found, for example:

$ cat $file
cat: '"\341\203\236\341\203\220\341\203\240\341\203\220\341\203\233\341\203\224\341\203\242\341\203\240\341\203\224\341\203\221\341\203\230.ext"': No such file or directory

$ file="${file:1:${#file}-2}"

$ cat $file
cat: '\341\203\236\341\203\220\341\203\240\341\203\220\341\203\233\341\203\224\341\203\242\341\203\240\341\203\224\341\203\221\341\203\230.ext': No such file or directory

How do I handle files with non ascii characters?


Solution

  • You can use the -z option to get nul termination instead of the C string literal quoting to deal with non-ASCII characters in paths.

    files=$(
        git diff -z --cached --name-only --diff-filter=ACMR \
        | grep -Ez "$extension_regex" \
        | tr \\0 \\n
    )
    

    utf-8 is still not completely universal and may never be, filesystems are so disparate that anything beyond ASCII is not entirely portable. Git's playing it annoyingly safe with its default to encoding anything that won't roundtrip in ASCII using C string literal conventions, but its choice does have that safe roundtrippability going for it which basically nothing else does (at least not yet) so there's that.

    If you're not worried about completely unconstrained file names, in particular if you don't need to handle file names containing their own \n's, newlines, you can hike the tr up a step and remove the -z option on the grep, or drop the -z option entirely, turn core.quotepath off. From the command line:

    git -c core.quotepath=false diff --name-only | grep etc
    

    or in the configs.