Search code examples
regexbashfindquoting

BASH find files with ñ in name


Already tried multiple solutions around but none seems to work.

For example if I try the next command it works as expected

find . -type f -name *x*

it returns:

./alphabet/output/b/box.jpg

./alphabet/output/t/taxi.jpg

but if I try any special character in the Spanish alphabet the command doesn't work

find . -type f -name *ñ*

The results are empty.

If I try

find . -type f -name *n*

then it shows also the filenames with the special character ñ

Also it doesn't work if I try to set the LANG variable for the command

LANG=C find . -type f -name *ñ*

or with regex

LANG=C find . -type f -name *.jpg -regex '.*[ñ].*'

Solution

  • (Part of this is stolen from a previous answer of mine.)

    Unicode allows some accented characters to be represented in several different ways: as a "code point" representing the accented character, or as a series of code points representing the unaccented version of the character, followed by the accent(s). For example, "ñ" could be represented either precomposed as U+00F1 (UTF-8 0xc3b1, Latin small letter n with tilde) or decomposed as U+006E U+0303 (UTF-8 0x6ecc83, Latin small letter n + combining tilde).

    OS X's HFS+ filesystem requires that all filenames be stored in the UTF-8 representation of their fully decomposed form (with a few exceptions that aren't relevant here). In an HFS+ filename, "ñ" MUST be encoded as 0x6ecc83.

    When you type "ñ" on the keyboard, it uses the composed form U+00F1 (0xc3b1). You can see this with a hex dump:

    $ echo ñ | xxd
    00000000: c3b1 0a                                  ...
    

    (note: the "0a" is a newline at the end of the "line" of output from echo.) But when you use it in a filename on a MacOS Extended volume, it gets converted to the decomposed form U+006E U+0303 (0x6ecc83):

    $ touch ñ $ ls | xxd 00000000: 6ecc 830a n...

    In a UTF-8 locale these two different representations should be considered the same character, but apparently the find in macOS doesn't do this right:

    $ LC_ALL=en_US.UTF-8 find . -name '*ñ*'
    $ LC_ALL=en_US.UTF-8 find . -name '*n*'
    ./ñ
    $ LC_ALL=en_US.UTF-8 find . -name 'n?'
    ./ñ
    

    In the second and third commands, find is matching against the "n" code point, and treating the combining tilde as a completely separate character that follows it. BTW, note that I put quotes around the match patterns -- this is important because without them the shell will expand it to a list of filenames in the current directory before passing it to the find command.

    The solution? Well, there's an icky option of explicitly using the decomposed form in the pattern. You can do this with bash's $' ... ' quoting form, which allows hex bytes to be specified with \x:

    $ find . -name $'*n\xcc\x83*'
    ./ñ
    

    But it's actually even worse than that, because starting in macOS High Sierra, Apple's using the new Apple File System (APFS), which allows both representations. And since find doesn't recognize them as characters, you can't even use a bracket expression like -name *[ññ]*' to match both of them, you have to use an extended regular expression with-Eand-regex`, like this (done on a Mac with APFS):

    $ touch composed-ñ decomposed-n$'\xcc\x83' unaccented-n
    $ ls
    composed-ñ  decomposed-ñ    unaccented-n
    $ ls | xxd
    00000000: 636f 6d70 6f73 6564 2dc3 b10a 6465 636f  composed-...deco
    00000010: 6d70 6f73 6564 2d6e cc83 0a75 6e61 6363  mposed-n...unacc
    00000020: 656e 7465 642d 6e0a                      ented-n.
    $ find -E . -regex $'.*(\xc3\xb1|n\xcc\x83).*'
    ./composed-ñ
    ./decomposed-ñ
    

    (note that in a regular expression, .* is the way you match any sequence of characters, equivalent to * in a plain "glob" wildcard pattern.)

    Isn't do-it-yourself Unicode support fun?