Search code examples
cperformanceloopsprintfstrcat

strcat() vs sprintf() inside a loop


I have a program that's removes every variables inside a string. Theses variables start with '$'. So, for, example, if I give a string like [1,2,$1,$2], it should return just [1,2].

However, which loop is better for perfomance?

This:

while (token != NULL)
{
    if (*token != '$')
    {
        sprintf(dst, "%s,%s", dst, token);
    }
    token = strtok(NULL, "], ");
}

or this:

while (token != NULL)
{
    if (*token != '$')
    {
        strcat(dst, token);
        strcat(dst, ",");
    }
    token = strtok(NULL, "], ");
}

Solution

    1. strtok is destructive, so the input string will not be usable after this code is executed. That being the case, you might as well do the transformation in place. This has a couple of advantages, one of which being that you do not need to allocate any memory (because the final string cannot be longer than the original string). This also requires a little bit of extra bookkeeping, but that provides another advantage: the resulting function's execution time is linear in the size of the input. Restarting the scan of the output buffer from the beginning on each iteration -- as both of your solutions do -- makes the function quadratic in the length of the string. [Note 1] The use of a quadratic algorithm is much more serious than minor differences in the cost of alternative standard library calls.

    2. As various people have mentioned, it is Undefined Behaviour to call sprintf in such a way that there is overlap between the output buffer and one of the strings to be printed. So your usage of sprintf is incorrect even though it might appear to work on some implementation.

    3. Neither strcat nor sprintf protects you from buffer overflow. You could use snprintf (by placing the new string at the end of the accumulated buffer, instead of overwriting the beginning of the buffer with itself on each iteration) or you could use strncat, but in both cases you'd need to do some extra bookkeeping.

    So here's a quick implementation of the algorithm proposed in the first point. Note that it does not call malloc() nor does it allocate any string on the stack. Also note that it uses memmove rather than memcpy to move the newly-discovered token forward in the string in order to avoid problems if the token and its destination were to overlap. (memmove allows an overlap; memcpy, strcpy and strcat do not.)

    /* Compresses the string str in place by removing leading and trailing separator
     * characters (which are the characters in 'fs') and replacing any interior
     * sequence of separator characters with a single instance of 'ofs'.
     * Returns 'str'.
     */
    char* compress(char* str, const char* fs, char ofs) {
      char* out = str;
      char* token = strtok(str, fs);
      while (token != NULL) {
        size_t tlen = strlen(token);
        memmove(out, token, tlen);
        out += tlen;
        *out++ = ofs;
        token = strtok(NULL, fs);
      }
      /* Overwrite the last delimiter (if there was one) with a NUL */
      if (out != str) --out;
      *out = 0;
      return str;
    }
    

    API notes:

    • Unlike the original, this does not discard tokens starting with a $. That would be trivial to add.

    • Also unlike the original, this function goes to the trouble of avoiding a trailing ,. Again, that would be easy to change if there were a good reason. (However, the trailing comma means that strings with just one token will end up being one character longer, so the in-place guarantee can't be made.)

    • I chose to return the address of the start of the compressed string (which is the same as the address of the input buffer) to be consistent with various standard C interfaces. However, in many cases it would be more useful to return out (which is the address of the trailing NUL), in order to allow further concatenation without having to compute the new length of the string. Alternatively, one could return the new length of the string, as sprintf does (return out - str;)

    • This API is honest about the fact that it destroys the original string (by overwriting it with the transformed one); functions which simply call strtok on their input but return a separate output can cause subtle bugs because it is not obvious to the caller that the original string is destroyed. While it is not possible to restore a string after strtok has been called on it, it is easy to convert an in-place algorithm into a non-destructive algorithm, by simply copying the original string:

      /* Returns freshly allocated memory; caller is responsible for freeing it */
      char* compress_and_copy(const char* str, const char* fs, char ofs) {
        return compress(strdup(str), fs, ofs);
      }
      
    • It's of course possible that the original unsimplified function did not have the feature that it was guaranteed to produce a shorter string; for example, it might be expanding segments starting with a $ by replacing them with the value of a variable. In this case, it would be necessary to produce a new string.

      In some cases, maybe even most cases, the output will still be shorter than the input. But one should resist the temptation to do the transformation in place if possible and allocate a new string only if necessary. Although it might be more efficient, it complicates the rules for allocation ownership; you end up having to say "caller owns the returned string only if it is different from the original string", which is clumsy and accident-prone.

      So if that's the actual use case, the optimal solution (from the perspective of clean API design) is to use strspn() and strcspn() to non-destructively walk the original string. That's a little more work, because it needs even more bookkeeping; on the other hand, it avoids the need to recompute strlen(token) after the token has been identified.

    Notes:

    1. Technically, the time is proportional to the product of the length of the string and the number of tokens, but in the worst case (where the tokens are short) this is effectively O(n2).