Search code examples
unicodeglibvala

Accessing a string out of bounds does not trigger any valgrind/ASAN/UBSAN warnings


I have this code:

static int main(string[] args) {
        info(escape_latex(args[1]));
        return 0;
}
string escape_latex(string input) {
        var builder = new StringBuilder.sized(input.length + 20);
        var map = new Gee.HashMap<string, string>();
        // ...<Snip>...
        // Fix for some weird unicode bugs
        map["\xff\xbf\xbf\xbf\xbf\xbf"] = "";
        info("Len: %d", input.char_count());
        for(var i = 0; i < input.char_count(); i++) {
                var ic = input.get_char(i);
                var as_string = ic.to_string();
                info("%d %s", i, as_string);
                if(map.has_key(as_string)) {
                        builder.append(map[as_string]);
                } else {
                        builder.append_unichar(ic);
                }
        }
        return builder.str;
}

If I pass "foo123", I get the expected output "foo123". But If I pass "Geldbeutel+Schlüsselanhänger", I get the output "Geldbeutel+Schl?sselanh?ng" (Last two chars are missing).

Now I changed the for-loop to for(var i = 0; i <= input.char_count(); i++) {

For "foo123", I get the expected output, for "Geldbeutel+Schlüsselanhänger", I get "Geldbeutel+Schl?sselanh?nge". (Valgrind, ASAN and UBSAN don't show anything).

Now I change the for-loop to for(var i = 0; i <= input.char_count() + 1; i++) {

"foo123" is now foo123G, as I run over into other memory, but "Geldbeutel+Schlüsselanhänger" give the correct output "Geldbeutel+Schl?sselAnh?nger"

For the last example input, an example output:

** INFO: 19:41:57.903: a.vala:23: Len: 28
** INFO: 19:41:57.903: a.vala:29: 0 G
** INFO: 19:41:57.903: a.vala:29: 1 e
** INFO: 19:41:57.903: a.vala:29: 2 l
** INFO: 19:41:57.903: a.vala:29: 3 d
** INFO: 19:41:57.903: a.vala:29: 4 b
** INFO: 19:41:57.903: a.vala:29: 5 e
** INFO: 19:41:57.903: a.vala:29: 6 u
** INFO: 19:41:57.903: a.vala:29: 7 t
** INFO: 19:41:57.903: a.vala:29: 8 e
** INFO: 19:41:57.903: a.vala:29: 9 l
** INFO: 19:41:57.903: a.vala:29: 10 +
** INFO: 19:41:57.903: a.vala:29: 11 S
** INFO: 19:41:57.903: a.vala:29: 12 c
** INFO: 19:41:57.903: a.vala:29: 13 h
** INFO: 19:41:57.903: a.vala:29: 14 l
** INFO: 19:41:57.903: a.vala:29: 15 ?
** INFO: 19:41:57.903: a.vala:29: 17 s
** INFO: 19:41:57.903: a.vala:29: 18 s
** INFO: 19:41:57.903: a.vala:29: 19 e
** INFO: 19:41:57.903: a.vala:29: 20 l
** INFO: 19:41:57.903: a.vala:29: 21 a
** INFO: 19:41:57.903: a.vala:29: 22 n
** INFO: 19:41:57.903: a.vala:29: 23 h
** INFO: 19:41:57.903: a.vala:29: 24 ?
** INFO: 19:41:57.903: a.vala:29: 26 n
** INFO: 19:41:57.903: a.vala:29: 27 g
** INFO: 19:41:57.903: a.vala:29: 28 e
** INFO: 19:41:57.903: a.vala:29: 29 r            // <- Here, I access an invalid index, but it works
** INFO: 19:41:57.903: a.vala:2: Geldbeutel+Schl?sselanh?nger

It seems to be related to unicode, but I can't find a way to make this function work.


Solution

  • It is to do with the locale and the default for the C runtime environment is US ASCII. You can set it to the user preferred locale for the runtime environment by passing an empty string to Intl.setlocale() for LocaleCategory.ALL, which are also the default parameter values, so Intl.setlocale(); will work:

    static int main(string[] args) {
            Intl.setlocale();
            print(escape_latex(args[1]) + "\n");
            return 0;
    }
    string escape_latex(string input) {
            var builder = new StringBuilder.sized(input.length + 20);
            var map = new Gee.HashMap<string, string>();
            // ...<Snip>...
            // Fix for some weird unicode bugs
            map["\xff\xbf\xbf\xbf\xbf\xbf"] = "";
            info("Len: %d", input.char_count());
            for(var i = 0; i < input.char_count(); i++) {
                    var ic = input.get_char(i);
                    var as_string = ic.to_string();
                    info("%d %s", i, as_string);
                    if(map.has_key(as_string)) {
                            builder.append(map[as_string]);
                    } else {
                            builder.append_unichar(ic);
                    }
            }
            return builder.str;
    }