Search code examples
calphabetstring.h

Remove all non alphabet characters from a string in C -- possible compiler issue


I'm writing a function in C that will take a string and remove all characters that are not a lowercase alphabetic character. I have this code written so far:

void strclean(char* str) {
   while (*str) {
      if (!(*str >= 'a' && *str <= 'z')) {
         strcpy(str, str + 1);
         str--;
      }
      str++;
   }
}

When I pass it the string "hello[][]world" the function seems to mostly work, except the output is:

hellowoldd

When I make it print after every line that it enters the if statement, here is the output I receive:

hello][]woldd
hello[]woldd
hello]woldd
hellowoldd

It seems to be really close but I can't understand why it would be producing this output! The weirdest part is I have given the code to two other friends and it works fine on their computers. We are all running the same version of Linux (ubuntu 14.04.3), and are all using gcc to compile.

I'm not sure if there is an issue with the code that would cause inconsistency of the output, or if it's a compiler issue that's creating the problem. Maybe it has to do with strcpy on my machine compared to theirs?


Solution

  • The strcpy function is not guaranteed to work if the ranges overlap, as they do in your case. From C11 7.24.2.3 The strcpy function /2 (my emphasis):

    The strcpy function copies the string pointed to by s2 (including the terminating null character) into the array pointed to by s1. If copying takes place between objects that overlap, the behavior is undefined.

    You can use something like memmove, which does work with overlapping ranges, as per C11 7.24.2.2 The memmove function /2:

    The memmove function copies n characters from the object pointed to by s2 into the object pointed to by s1. Copying takes place as if the n characters from the object pointed to by s2 are first copied into a temporary array of n characters that does not overlap the objects pointed to by s1 and s2, and then the n characters from the temporary array are copied into the object pointed to by s1.


    But there's a better solution that's O(n) rather than O(n2) in time complexity, while still being overlap-safe:

    void strclean (char* src) {
        // Run two pointers in parallel.
    
        char *dst = src;
    
        // Process every source character.
    
        while (*src) {
            // Only copy (and update destination pointer) if suitable.
            // Update source pointer always.
    
            if (islower(*src)) *dst++ = *src;
            src++;
        }
    
        // Finalise destination string.
    
        *dst = '\0';
    }
    

    You'll notice I'm also using islower() (from ctype.h) to detect lower case alphabetic characters. This is more portable since the C standard does not mandate that the alpha characters have consecutive code points (the digits are the only ones guaranteed to be consecutive).

    There's also no separate need to check for isalpha() since, as per C11 7.4.1.2 The isalpha function /2, islower() == true implies isalpha() == true:

    The isalpha function tests for any character for which isupper or islower is true, or ...