Search code examples
javastringhashmap

Lookup substring in map, case-insensitive


I wrote a small lexer that transforms a charbuffer into a stream of tokens. One type of tokens are identifiers, which can either be "raw" or a keyword. To test for the latter, I have a map populated with all keywords.

Map<String, MyType> lookup = new HashMap<>();
lookup.put("RETURN", KEYWORD_RETURN);
[...]

The map is populated with all uppercase strings.

Now, what I get from my input charbuffer is simply an offset and the length, where I can find my identifier (which needn't be in uppercase there).

The obvious solution looks somewhat like this.

bool lookupIdentifier(CharBuffer buffer, int offset, int length, Map<String, MyType> lookupTable) {
    int current = buffer.position();
    buffer.rewind();
    String toCheck = buffer.subSequence(offset, offset + length).toString().toUpperCase();
    buffer.position(current);
    return lookupTable.containsKey(toCheck);
}

There are around 50 entries in the map. Is a TreeMap with a case-insensitive comparator a good alternative to the O(1) HashMap lookup?

What I dislike about my approach is that the creation of the toCheck string allocates. Is there a way to reuse the substring in the CharBuffer for lookup?


Solution

  • You can avoid expensive string construction by using a CharBuffer as key type:

    Map<CharBuffer, MyType> lookup = new TreeMap<>(Comparator
        .comparingInt(CharBuffer::remaining)
        .thenComparing((cb1,cb2) -> {
            for(int p1 = cb1.position(), p2 = cb2.position(); p1 < cb1.limit(); p1++, p2++) {
                char c1 = cb1.get(p1), c2 = cb2.get(p2);
                if(c1 == c2) continue;
                c1 = Character.toUpperCase(c1);
                c2 = Character.toUpperCase(c2);
                if(c1 != c2) return Integer.compare(c1, c2);
            }
            return 0;
        }));
    lookup.put(CharBuffer.wrap("RETURN"), MyType.KEYWORD_RETURN);
    
    boolean lookupIdentifier(
        CharBuffer buffer, int offset, int length, Map<CharBuffer, MyType> lookupTable) {
    
        int currentPos = buffer.position(), currLimit = buffer.limit();
        buffer.clear().position(offset).limit(offset + length);
        boolean result = lookupTable.containsKey(buffer);
        buffer.clear().position(currentPos).limit(currLimit);
        return result;
    }
    

    The comparator uses a cheap length comparison before performing a case insensitive character comparison. This assumes that you stay with keywords like RETURN, which have a simple case mapping.

    For a map with 50 keywords, using log₂ comparisons for a lookup might still yield reasonable performance. Mind that each comparison stops at the first mismatch.


    You can use hashing with a dedicated wrapper object:

    final class LookupKey {
        final CharBuffer cb;
        LookupKey(CharBuffer cb) {
            this.cb = cb;
        }
        @Override public int hashCode() {
            int code = 1;
            for(int p = cb.position(); p < cb.limit(); p++) {
                code = Character.toUpperCase(cb.get(p)) + code * 31;
            }
            return code;
        }
        @Override public boolean equals(Object obj) {
            if(!(obj instanceof LookupKey)) return false;
            final LookupKey other = (LookupKey)obj;
            CharBuffer cb1 = this.cb, cb2 = other.cb;
            if(cb1.remaining() != cb2.remaining()) return false;
            for(int p1 = cb1.position(), p2 = cb2.position(); p1 < cb1.limit(); p1++, p2++) {
                char c1 = cb1.get(p1), c2 = cb2.get(p2);
                if(c1 == c2) continue;
                c1 = Character.toUpperCase(c1);
                c2 = Character.toUpperCase(c2);
                if(c1 != c2) return false;
            }
            return true;
        }
    }
    
    Map<LookupKey, MyType> lookup = new HashMap<>();
    lookup.put(new LookupKey(CharBuffer.wrap("RETURN")), MyType.KEYWORD_RETURN);
    
    boolean lookupIdentifier(
        CharBuffer buffer, int offset, int length, Map<LookupKey, MyType> lookupTable) {
    
        int currentPos = buffer.position(), currLimit = buffer.limit();
        buffer.clear().position(offset).limit(offset + length);
        boolean result = lookupTable.containsKey(new LookupKey(buffer));
        buffer.clear().position(currentPos).limit(currLimit);
        return result;
    }
    

    The construction of a lightweight object like the LookupKey, which unlike String construction doesn’t need to copy character contents, is negligible. But note that hashing, unlike a comparator, has to process all characters upfront, which can turn out to be more expensive than the log₂ comparisons of a small TreeMap.

    If these keywords are unlikely to change, an explicit lookup code, i.e. a switch over invariant properties of the key strings, can be even more efficient. E.g. start with switching over the length, if most keywords differ in length, then over a character which is different for most keywords (include case labels for uppercase and lowercase variant). Another alternative is a hierarchical lookup structure for these properties.