Search code examples
macoscocoaurlappkitcore-foundation

When should I be using NSURLCanonicalPathKey?


In macOS 10.12, NSURLCanonicalPathKey was added to NSURL. The documentation states:

The URL's path as a canonical absolute file system path.

Outside of that, the only other documentation/information I've seen of it is from a Swift Forum post that states:

You might want to take a look at .canonicalPathKey (NSURLCanonicalPathKey). On Apple platforms a lot of the standard UNIXy paths exist within /private/, with corresponding symlinks from the root. So /etc/ is actually /private/etc/. If you don’t canonicalise the paths you can get tripped up by this.

This seems like a pretty big deal to me yet I'm surprised it was only introduced in 10.12. I've only ever relied on NSURLPathKey, .path or bookmark data for resolving URLs and never had a problem.

  • Should I now be using the canonical path everywhere I previously used the standard path value?

  • If I'm storing path information in a database as a string, should I store the value of .path or NSURLCanonicalPathKey?

  • If I'm converting an NSURL to a string representation for use in a C/C++ library that requires a file path, should I use canonical path representation?

  • If you're displaying the path of a file to the user, should you show the canonical path?

  • How does NSURLCanonicalPathKey compare to URLByStandardizingPath and URLByResolvingSymlinksInPath, which seem to sort of do the same thing or the opposite thing...(?)

This is on macOS 10.14 and I'm only considering URLs that point to files or folders. I'm aware that bookmark data should probably be stored in a database rather than paths.


Solution

  • It depends on how you plan to use the path:

    • If you just want to display the path to the user, or store it for later re-creation of a URL with [NSURL fileURLWithPath:], then you can keep using the regular path as you received it, because usually you get the paths because the user gave it to you in some way, and then it's best if you do not alter it.
    • Re-creating a URL works with either path representation, of course. But if you create a URL from "/etc" and one from "/private/etc", [NSURL isEqual:] will give you false - if you do not like that, you'll have to canonicalize them.
    • So, if you want to register the path in order to re-identify the same path given to you later again, then you should canonicalize it.
    • Keep in mind that getting the canonical path adds significant processing time (it could easily double it). That's why you want to avoid using it if it's not necessary.

    Unicode normalization may also be of significance. E.g, if a file or folder uses precomposed (NFC) characters, the NSURL methods will turn them into NFD strings. OTOH, the BSD/POSIX functions won't do that. So, if you, for example, get the paths from a shell command and then compare them to paths you have from NSURLs, they may not calculate as equal due to one using NFC and the other NFD chars. Ideally, if NSURL or NSFileManager gets involved with the paths, then you should also first pass your BSD paths through NSURL so that you end up having both types of paths in the same composition format.

    Examples

    Simple example where normalization is not involved:

    Input URLByStandardizingPath NSURLCanonicalPathKey
    /private/var /var /private/var
    /var /var /private/var

    Unicode normalization

    The following example uses a prepared APFS volume that contains file names with both a precomposed and a decomposed representation of the letter "ü", along with symlinks. You can download the disk image file here.

    The directory layout is as follows:

    $ cd /Volumes/Canonical_Normalize_Test/
    $ ls -lR
    total 24
    -rw-r--r--  1 user  staff   19 Dec 29 19:27 decomposed_ü
    -rw-r--r--  1 user  staff   19 Dec 29 19:27 precomposed_ü
    drwxr-xr-x  4 user  staff  128 Dec 29 19:36 symlink_target_dir
    lrwxr-xr-x  1 user  staff   18 Dec 29 19:36 symlink_to_dir -> symlink_target_dir
    -rwxr-xr-x@ 1 user  staff  763 Dec 15 16:28 unicode_composition_check.sh
    
    ./symlink_target_dir:
    total 0
    lrwxr-xr-x  1 user  staff  17 Dec 29 19:36 decomposed_ü -> ../decomposed_ü
    lrwxr-xr-x  1 user  staff  17 Dec 29 19:36 precomposed_ü -> ../precomposed_ü
    

    The file "unicode_composition_check.sh" is a script that creates the two "...ü" files, one name using NFD, the other NFC (the script is inadequately named, unfortunately).

    Input is:

    /Volumes/Canonical_Normalize_Test/symlink_to_dir/precomposed_\U00fc
    

    (I.e. the path includes a directory symlink and uses the actual file's unicode composition, i.e. the target file name's "ü" is precomposed.)

    Method Result
    fileSystemRepresentation /Volumes/Canonical_Normalize_Test/symlink_to_dir/precomposed_u\U0308
    URLByStandardizingPath /Volumes/Canonical_Normalize_Test/symlink_to_dir/precomposed_u\U0308
    NSURLCanonicalPathKey /Volumes/Canonical_Normalize_Test/symlink_target_dir/precomposed_u\U0308
    URLByResolvingSymlinksInPath /Volumes/Canonical_Normalize_Test/precomposed_u\U0308

    We see that each method gives a different result:

    1. They all appear to normalize the path into NFD, i.e. the "ü" gets decomposed in all cases. That's necessary and normal for regular case-insensitive volumes, as the lookup for file names is normalization-insensitive. However: For case-sensitive volumes, the composition must not be changed, and while I've not tested this, I assume that all the above functions will detect the volume's case sensitivity mode and behave accordingly.

    2. Only NSURLCanonicalPathKey gives the correct result that is needed if we want to re-identify the target item later by path (indifferent to which Unicode composition is used and whether the path includes symlinks to a directory): It resolves the directory symlink but not the final symlink that's inside the symlink_target_dir. If it did resolve the final path element (like URLByResolvingSymlinksInPath does), you would not be able to target symlink files.

    3. NSString's fileSystemRepresentation does not alter the path (but normalizes it) whereas NSURL's URLByStandardizingPath alters the path in some cases (e.g. by removing "/private" from certain root folders).

    4. Only NSURLCanonicalPathKey will fix upper/lower case based on the actual on-disk path. For example, a URL created from "/applications" will not be turned into the actual "/Applications" path by any of the other functions.

    Conclusion

    If you need to re-identify the path later, no matter which representation (normalization, symlinks to dirs) is used, use either NSURLCanonicalPathKey if you need to retain the actual item, even if it's a symlink, or use URLByResolvingSymlinksInPath to always identify the target of any symlinks given to you.

    Note, however (see first example) that if you use URLByResolvingSymlinksInPath, "/private/var/tmp" etc. will be turned into "/var/tmp" etc., which is unusal because it then still contains a symlink (i.e. "/var").

    Also keep in mind that the case may not be correct unless you get the canonical path. And to compensate for that, comparing paths requires you to first check whether the path is on a case-insensitive volume or not so that you use the correct comparison options (and, as an added complication, simply comparing paths with the "case insensitive" option may not be correct for some rare scripts on HFS+ volumes, because they use an older Unicode standard that had some other rules than the current macOS versions use).

    Lastly, if you just want to see if two paths point to the same file, it's safer to use other means that do not rely on paths. See this answer. And if you need to persistently remember file locations, it's best to use bookmarks, so that they are even found if the user has renamed or moved the file in the meantime.


    Disclaimer: All these findings were found empirically, as tested on both macOS 10.13.6 and 11.1 (and the systems in between), so you may want to double check my findings and leave a comment if you get different results.