Multiplying and adding float numbers

I have a task to convert some c++ code to asm and I wonder if what I am thinking makes any sense. First I would convert integers to floats. I would like to get array data to sse register, but here is problem, because I want only 3 not 4 integers, is there way to overcome that? Then I would convert those integers to floats using CVTDQ2PS and I would save those numbers in memory. For the const numbers like 0.393 I would make 3 vectors of floats and then I would do same operation three times so I will think about sepiaRed only. For that I would get my converted integers into sse register and I would multiply those numbers which would give me the result in xmm0 register. Now how can I add them together?

I guess my two questions are: how can I get 3 items from array to sse register, so that I can avoid any problems. And then how can I add three numbers in xmm0 register together.

    tmpGreen = (float)pixels[i + 1];
    tmpRed = (float)pixels[i + 2];
    tmpBlue = (float)pixels[i];

    sepiaRed = (int)(0.393 * tmpRed + 0.769 * tmpGreen + 0.189 * tmpBlue); //red
    sepiaGreen = (int)(0.349 * tmpRed + 0.686 * tmpGreen + 0.168 * tmpBlue); //green
    sepiaBlue = (int)(0.272 * tmpRed + 0.534 * tmpGreen + 0.131 * tmpBlue); //blue

Solution

You can't easily horizontally add 3 numbers together; Fastest way to do horizontal SSE vector sum (or other reduction)

What you can efficiently do is map 4 pixels in parallel, with vectors of 4 reds, 4 greens, and 4 blues. (Which you'd want to load from planar, not interleaved, pixel data. struct of arrays, not array of structs.)

You might be able to get some benefit for doing a single pixel at once, though, if you just load 4 ints with movdqu and use a multiplier of 0.0 for the high element after cvtdq2ps. Then you can do a normal horizontal sum of 4 elements instead of having to adjust it. (Hmm, although doing 3 would let you do the 2nd shuffle in parallel with the first add, instead of after.)

Using SIMD inefficiently loses some of the benefit; see guides in https://stackoverflow.com/tags/sse/info especially https://deplinenoise.wordpress.com/2015/03/06/slides-simd-at-insomniac-games-gdc-2015/ re: how people often try to use one SIMD vector to hold one x,y,z geometry vector, and then find that SIMD didn't help much.