Let's start with this:
I have a block of memory of 16 bytes and I need to copy only even bytes to a 8 bytes block of memory.
My current algorithm is doing something like this:
unsigned int source_size = 16, destination_size = 8, i;
unsigned char * source = new unsigned char[source_size];
unsigned char * destination = new unsigned char[destination_size];
// fill source
for( i = 0; i < source_size; ++i)
{
source[i] = 0xf + i;
}
// source :
// 0f 10 11 12 13 14 15 16 17 18 19 1a 1b 1c 1d 1e
// copy
for( i = 0; i < destination_size; ++i)
{
destination[i] = source[i * 2];
}
// destination :
// 0f 11 13 15 17 19 1b 1d
It's just an example, because I would like to know if there's a better method to do this when I need to get every 3rd byte or every 4th byte, not just even bytes.
I know using loop I can achieve this but I need to optmize this... I don't exactly know how to use SSE so I dont't know if it's possible to use in this case, but something like memcpy magic kinda thing would be great.
I also thought about using a macro to get rid of the loop since the size of the source and the destination are both constant, but that doesn't look like a big deal.
Maybe you can think out of the box if I say that this is to extract YCbCr bytes of a YUYV pixel format. Also I need to emphasize that I'm doing this to get rid of the libswscale.
Unfortunately, you can't do this with memcpy()
tricks only. Modern processors have 64 bit registers and it is the optimal size for memory transfers. Modern compilers always try optimize memcpy()
calls to do 64- (or 32- or even 128-) bit transfers at a time.
But in your case you need 'strange' 24 or 16 bit transfers. It is exactly why do we have SSE, NEON and other processor extensions. And that's why they are widely used in video processing.
So in your case, you should use one of SSE optimized libs or write your own assembler code that will do this memory transfers.