Search code examples
swiftfoundationcharacter-set

Checking CharacterSet for single UnicodeScalar yields strange behaviour


While working with CharacterSet I've come across an interesting problem. From what I have gathered so far CharacterSet is based around UnicodeScalar; you can initialise it with scalars and check if a scalar is contained within the set. Querying the set to find out if it contains a Character, who's glyph could be composed of several unicode scalar values, doesn't make sense.

My problem lies when I test with the 😆 emoji, which is a single unicode scalar value (128518 in decimal). As this is a single unicode scalar value I would have thought it would work, and here are the results:

"😆" == UnicodeScalar(128518)! // true

// A few variations to show exactly what is being set up
let supersetA = CharacterSet(charactersIn: "😆")
let supersetB = CharacterSet(charactersIn: "A😆")
let supersetC = CharacterSet(charactersIn: UnicodeScalar(128518)!...UnicodeScalar(128518)!)
let supersetD = CharacterSet(charactersIn: UnicodeScalar(65)...UnicodeScalar(65)).union(CharacterSet(charactersIn: UnicodeScalar(128518)!...UnicodeScalar(128518)!))

supersetA.contains(UnicodeScalar(128518)!) // true
supersetB.contains(UnicodeScalar(128518)!) // false
supersetC.contains(UnicodeScalar(128518)!) // true
supersetD.contains(UnicodeScalar(128518)!) // false

As you can see, the check works if the CharacterSet contains a single scalar value (perhaps due to an optimisation) but in any other circumstance it doesn't work as expected.

I cannot find any information about the lower level implementation of CharacterSet or whether it works in a certain encoding (i.e. UTF-16 like NSString), but as the API deals a lot with UnicodeScalar I'm surprised it's failing like this, and I'm unsure as to why it's happening, or how to investigate further.

Can anyone shed any light on why this may be?


Solution

  • The source code to CharacterSet is available, actually. The source for contains is:

    fileprivate func contains(_ member: Unicode.Scalar) -> Bool {
        switch _backing {
        case .immutable(let cs):
            return CFCharacterSetIsLongCharacterMember(cs, member.value)
        case .mutable(let cs):
            return CFCharacterSetIsLongCharacterMember(cs, member.value)
        }
    }
    

    So it basically just calls through to CFCharacterSetIsLongCharacterMember. The source code for that is also available, although only for Yosemite (the versions for El Cap and Sierra both say "Coming Soon"). However, the Yosemite code seemed to match what I was seeing in the disassembly on Sierra. Anyway, the code for that looks like this:

    Boolean CFCharacterSetIsLongCharacterMember(CFCharacterSetRef theSet, UTF32Char theChar) {
        CFIndex length;
        UInt32 plane = (theChar >> 16);
        Boolean isAnnexInverted = false;
        Boolean isInverted;
        Boolean result = false;
    
        CF_OBJC_FUNCDISPATCHV(__kCFCharacterSetTypeID, Boolean, (NSCharacterSet *)theSet, longCharacterIsMember:(UTF32Char)theChar);
    
        __CFGenericValidateType(theSet, __kCFCharacterSetTypeID);
    
        if (plane) {
            CFCharacterSetRef annexPlane;
    
            if (__CFCSetIsBuiltin(theSet)) {
                isInverted = __CFCSetIsInverted(theSet);
                return (CFUniCharIsMemberOf(theChar, __CFCSetBuiltinType(theSet)) ? !isInverted : isInverted); 
            }
    
            isAnnexInverted = __CFCSetAnnexIsInverted(theSet);
    
            if ((annexPlane = __CFCSetGetAnnexPlaneCharacterSetNoAlloc(theSet, plane)) == NULL) {
                if (!__CFCSetHasNonBMPPlane(theSet) && __CFCSetIsRange(theSet)) {
                    isInverted = __CFCSetIsInverted(theSet);
                    length = __CFCSetRangeLength(theSet);
                    return (length && __CFCSetRangeFirstChar(theSet) <= theChar && theChar < __CFCSetRangeFirstChar(theSet) + length ? !isInverted : isInverted);
                } else {
                    return (isAnnexInverted ? true : false);
                }
            } else {
                theSet = annexPlane;
                theChar &= 0xFFFF;
            }
        }
    
        isInverted = __CFCSetIsInverted(theSet);
    
        switch (__CFCSetClassType(theSet)) {
            case __kCFCharSetClassBuiltin:
                result = (CFUniCharIsMemberOf(theChar, __CFCSetBuiltinType(theSet)) ? !isInverted : isInverted);
                break;
    
            case __kCFCharSetClassRange:
                length = __CFCSetRangeLength(theSet);
                result = (length && __CFCSetRangeFirstChar(theSet) <= theChar && theChar < __CFCSetRangeFirstChar(theSet) + length ? !isInverted : isInverted);
                break;
    
            case __kCFCharSetClassString:
                result = ((length = __CFCSetStringLength(theSet)) ? (__CFCSetBsearchUniChar(__CFCSetStringBuffer(theSet), length, theChar) ? !isInverted : isInverted) : isInverted);
                break;
    
            case __kCFCharSetClassBitmap:
                result = (__CFCSetCompactBitmapBits(theSet) ? (__CFCSetIsMemberBitmap(__CFCSetBitmapBits(theSet), theChar) ? true : false) : isInverted);
                break;
    
            case __kCFCharSetClassCompactBitmap:
                result = (__CFCSetCompactBitmapBits(theSet) ? (__CFCSetIsMemberInCompactBitmap(__CFCSetCompactBitmapBits(theSet), theChar) ? true : false) : isInverted);
                break;
    
            default:
                CFAssert1(0, __kCFLogAssertion, "%s: Internal inconsistency error: unknown character set type", __PRETTY_FUNCTION__); // We should never come here
                return false; // To make compiler happy
        }
    
        return (result ? !isAnnexInverted : isAnnexInverted);
    }
    

    So we can follow along, and figure out what's going on. Unfortunately we have to bust out our x86_64 assembly skills to do it. But fear not, for I have done this for you already, because apparently this is what I do for fun on a Friday night.

    A helpful thing to have is the data structure:

    struct __CFCharacterSet {
        CFRuntimeBase _base;
        CFHashCode _hashValue;
        union {
            struct {
                CFIndex _type;
            } _builtin;
            struct {
                UInt32 _firstChar;
                CFIndex _length;
            } _range;
            struct {
                UniChar *_buffer;
                CFIndex _length;
            } _string;
            struct {
                uint8_t *_bits;
            } _bitmap;
            struct {
                uint8_t *_cBits;
            } _compactBitmap;
       } _variants;
       CFCharSetAnnexStruct *_annex;
    };
    

    We'll need to know what the heck CFRuntimeBase is, too:

    typedef struct __CFRuntimeBase {
        uintptr_t _cfisa;
        uint8_t _cfinfo[4];
    #if __LP64__
        uint32_t _rc;
    #endif
    } CFRuntimeBase;
    

    And guess what! There are also some constants that we'll need.

    enum {
            __kCFCharSetClassTypeMask = 0x0070,
                __kCFCharSetClassBuiltin = 0x0000,
                __kCFCharSetClassRange = 0x0010,
                __kCFCharSetClassString = 0x0020,
                __kCFCharSetClassBitmap = 0x0030,
                __kCFCharSetClassSet = 0x0040,
                __kCFCharSetClassCompactBitmap = 0x0040,
        // irrelevant stuff redacted
    };
    

    We can then break on CFCharacterSetIsLongCharacterMember and log the structure:

    supersetA.contains(UnicodeScalar(128518)!)
    
    (lldb) po [NSData dataWithBytes:$rdi length:48]
    <21b3d2ad ffff1d00 90190000 02000000 00000000 00000000 06f60100 00000000 01000000 00000000 00000000 00000000>
    

    Based on the structs above, we can figure out what this character set is made of. In this case, the relevant part is going to be the first byte of cfinfo from CFRuntimeBase, which are bytes 9-12. The first byte of this, 0x90 contains the type information for the character set. It needs to be ANDed with __kCFCharSetClassTypeMask, which gets us 0x10, which is __kCFCharSetClassRange.

    For this line:

    supersetB.contains(UnicodeScalar(128518)!)
    

    the structure is:

    (lldb) po [NSData dataWithBytes:$rdi length:48]
    <21b3d2ad ffff1d00 a0190000 02000000 00000000 00000000 9066f000 01000000 02000000 00000000 00000000 00000000>
    

    This time byte 9 is 0xa0, which ANDed with the mask is 0x20, __kCFCharSetClassString.

    At this point the Monty Python cast are screaming "Get On With It!", so let's go through the source for CFCharacterSetIsLongCharacterMember and see what's going on.

    Skipping past all the CF_OBJC_FUNCDISPATCHV crap, we get to this line:

    if (plane) {
    

    This obviously evaluates to true in both cases. Next test:

    if (__CFCSetIsBuiltin(theSet)) {
    

    This evaluates to false in both cases, since neither type was __kCFCharSetClassBuiltin, so we skip that block.

    isAnnexInverted = __CFCSetAnnexIsInverted(theSet);
    

    In both cases, the _annex pointer was null (see all the zeros at the end of the structure), so this is false.

    This test will be true for the same reason:

    if ((annexPlane = __CFCSetGetAnnexPlaneCharacterSetNoAlloc(theSet, plane)) == NULL) {
    

    taking us to:

    if (!__CFCSetHasNonBMPPlane(theSet) && __CFCSetIsRange(theSet)) {
    

    The __CFCSetHasNonBMPPlane macro checks _annex, so that's false. The emoji, of course, is not in the BMP plane, so this actually seems wrong for both cases, even the one that was returning the correct result.

    __CFCSetIsRange checks if our type is __kCFCharSetClassRange, which is only true the first time. So this is our point of divergence. The second invocation of this, which produces the incorrect result, returns on the next line:

    return (isAnnexInverted ? true : false);
    

    And since the annex is NULL, causing isAnnexInverted to be false, this returns false.

    As for how to fix it... well, I can't. But now we know why it happened. From what I can tell, the main problem is that the _annex field isn't being filled when the character set is created, and since the annex seems to be used to keep track of characters in non-BMP planes, I think it ought to be present for both of the character sets. Incidentally, this information will probably be helpful in a bug report should you decide to file one (I'd file it against CoreFoundation, since that's where the actual issue is).