Search code examples
ctrimwidecharwidestring

c, trimming strings, and wide characters


Briefly, I'm parsing HTTP headers, received from libcurl, in an environment where I need wide characters. The headers arrive to me as char * strings, in the general format

name: value

I'm separating this into two strings by writing a null into the position of the colon, and then trimming:

        int offset = index_of( ':', s );

        if ( offset != -1 ) {
            s[offset] = ( char ) 0;
            char *name = trim( s );
            char *value = trim( &s[++offset] );

The trim function I'm using is one I wrote myself:

char *trim( char *s ) {
    int i;

    for ( i = strlen( s ); ( isblank( s[i] ) || iscntrl( s[i] ) ) && i >= 0;
          i-- ) {
        s[i] = '\0';
    }
    for ( i = 0; ( isblank( s[i] ) || iscntrl( s[i] ) ) && s[i] != '\0'; i++ );

    return ( char * ) &s[i];
}

I'm aware of this answer and have tried the trim functions recommended by it, but they didn't solve my problem so for the time being I've gone back to my own.

I then feed the trimmed strings to the mbstowcs function:

struct cons_pointer add_meta_string( struct cons_pointer meta, wchar_t *key,
                                     char *value ) {
    wchar_t buffer[strlen( value ) + 1];
    /* \todo something goes wrong here: I sometimes get junk characters on the
     * end of the string. */
    mbstowcs( buffer, value, strlen( value ) );
    return make_cons( make_cons( c_string_to_lisp_keyword( key ),
                                 c_string_to_lisp_string( buffer ) ), meta );
}

The junk character I get seems always to be the same one:

:: (inspect (assoc :owner (meta l)))

    STRG (1196577875) at page 7, offset 797 count 2
        String cell: character 's' (115) next at page 7 offset 798, count 2
         value: "simon翾"
"simon翾"
:: (inspect (cdr (cdr (cdr (cdr (cdr (assoc :owner (meta l)))))))))

    STRG (1196577875) at page 7, offset 802 count 2
        String cell: character '翾' (32766) next at page 0 offset 0, count 2
         value: "翾"

32766 is the highest signed 16-bit number, -1, which is probably significant; and implies to me that mbstowcs is reading off the end of the string, which implies in turn that strlen may be returning a spurious value.

I am able to read wide characters from the stream:

:: (assoc :x-lambda (meta l))

"λάμβδα"

I am by no means a C expert; this is the first significant C project I've done in almost 30 years, so I may be missing something very obvious; and help greatly appreciated. Full source code, if you're interested, is here.


Solution

  • Off by 1

    mbstowcs() converts arrays. If the result is to also include a null character, account fo that in the length passed to the function.

    // mbstowcs( buffer, value, strlen( value ) );
    mbstowcs( buffer, value, strlen( value ) + 1);
    

    Lack of a null character in buffer is likely messing up the following make_cons().


    Other

    for ( i = strlen( s ); ( isblank( s[i] ) || iscntrl( s[i] ) ) && i >= 0; i-- ) . is broken. Do i >= 0 test before s[i].

    Note is...(int ch) expects ch in the range of unsigned char and EOF. This code is UB when s[i] < 0. Usual fix: is...((unsigned char) s[i]).