Search code examples
cperformanceoptimizationstrcpy

Implementing a strcpy function in C


My task is like this: I need to implement the strcpy function under the following constraints:

  1. The function can have no more than seven statements.
  2. It should be as fast as possible.
  3. It should use the minimum amount of memory it can.
  4. In the function that will call my strcpy, the destination address will be held as follows: char* newDestination = NULL;
  5. The prototype of the strcpy function should be: void myStrcp(void** dst, void* src);

I came out with this solution which uses uint64_t to copy each iteration eight bytes. If so, my questions would be:

  1. Is there a better solution than mine - and if so please explain why it is better?
  2. Does it matter on which OS we are running the program (Windows Vs. Linux) and / or platform?

My solution (On Windows):

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include <conio.h>

void strCpy(void **dst, void *src);

int main()
{
    char *newLocation = NULL;

    strCpy((void **)&newLocation, "stringToBeCopied");
    printf("after my strcpy dst has the string: %s \n", newLocation);
    free(newLocation);
    getch();
    return 0;
}

void strCpy(void** dst, void* src)
{
    // Allocating memory for the dst string
    uint64_t i, length = strlen((char *)src), *locDst =
        (uint64_t *) malloc(length + 1), *locSrc = (uint64_t *) src;
    *dst = locDst;

    // Copy 8 Bytes each iteration
    for (i = 0; i < length / 8; *locDst++ = *locSrc++, ++i);

    // In case the length of the string is not alligned to 8 Bytes - copy the remainder
    // (last iteration)
    char *char_dst = (char *)locDst, *char_src = (char *)locSrc;

    for (; *char_src != '\0'; *char_dst++ = *char_src++);

    // NULL terminator
    *char_dst = '\0';
}

Solution

  • Vectorization is indeed the key. A better solution along the same thought would be using SSE/AVX for an even more efficient copy. This of course makes the program platform specific as you need to detect the max vectorization supported.

    Several issues you should also address:

    1. alignment of src/dst - if the chunk you copy (in your case above - a 64bit one) exceeds a cache line, then the HW would most likely incur an overhead in doing the copy due to cache line split. The overhead would probably become bigger in longer vectors (and is also more frequent there). You could therefore add a few initial checks to address this problem by copying a head in smaller chunks like you handle the tail.

    2. Can the src/dst regions collide? if so you need to provide a definition for correct functional behavior (it becomes less trivial in case of copying in chunks).

    3. Note the difference between strcpy and memcpy (see also here). This makes the vectorization much less trivial, so you need to define the requirement here. Currently your function might differ from what is expected in a classic strcpy, as you don't check for null bytes within each chunk. Not sure if that's an issue for you.

    4. Code size limitation is not very performance friendly (well, except when your bottleneck is instruction-cache capacity or branch predictability, but that's pretty advanced). The 7-statements limitation might mean you're overthinking this :)