My task is like this: I need to implement the strcpy
function under the following constraints:
strcpy
, the destination address will be held
as follows: char* newDestination = NULL;
strcpy
function should be: void myStrcp(void** dst, void* src);
I came out with this solution which uses uint64_t
to copy each iteration eight bytes. If so, my questions would be:
Windows
Vs. Linux
) and / or platform?#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include <conio.h>
void strCpy(void **dst, void *src);
int main()
{
char *newLocation = NULL;
strCpy((void **)&newLocation, "stringToBeCopied");
printf("after my strcpy dst has the string: %s \n", newLocation);
free(newLocation);
getch();
return 0;
}
void strCpy(void** dst, void* src)
{
// Allocating memory for the dst string
uint64_t i, length = strlen((char *)src), *locDst =
(uint64_t *) malloc(length + 1), *locSrc = (uint64_t *) src;
*dst = locDst;
// Copy 8 Bytes each iteration
for (i = 0; i < length / 8; *locDst++ = *locSrc++, ++i);
// In case the length of the string is not alligned to 8 Bytes - copy the remainder
// (last iteration)
char *char_dst = (char *)locDst, *char_src = (char *)locSrc;
for (; *char_src != '\0'; *char_dst++ = *char_src++);
// NULL terminator
*char_dst = '\0';
}
Vectorization is indeed the key. A better solution along the same thought would be using SSE/AVX for an even more efficient copy. This of course makes the program platform specific as you need to detect the max vectorization supported.
Several issues you should also address:
alignment of src/dst - if the chunk you copy (in your case above - a 64bit one) exceeds a cache line, then the HW would most likely incur an overhead in doing the copy due to cache line split. The overhead would probably become bigger in longer vectors (and is also more frequent there). You could therefore add a few initial checks to address this problem by copying a head in smaller chunks like you handle the tail.
Can the src/dst regions collide? if so you need to provide a definition for correct functional behavior (it becomes less trivial in case of copying in chunks).
Note the difference between strcpy and memcpy (see also here). This makes the vectorization much less trivial, so you need to define the requirement here. Currently your function might differ from what is expected in a classic strcpy, as you don't check for null bytes within each chunk. Not sure if that's an issue for you.
Code size limitation is not very performance friendly (well, except when your bottleneck is instruction-cache capacity or branch predictability, but that's pretty advanced). The 7-statements limitation might mean you're overthinking this :)