Why do I get a larger relative speedup vs. scalar from SIMD intrinsics with larger arrays?

I want to learn SIMD programming. Now I have some interesting moment in my code.

I just want to measure the time of work of my code. I try to apply some base function for my array with a particular size.

Firstly I try to use function that was written with SIMD instructions and after that I try to use usual aproach. And I compare time of this two realizations the same function.

I defined performance like (time without sse) / (time using sse).

But when my size is 8 , I have performance is 1.3, and when my size = 512 - I have Performance = 3, if I have size = 1000 performance = 4, if size = 4000 -> performance = 5.

I don't understand why my performance is increasing when size of array is increasing.

My code

void init(double* v, size_t size) {
    for (int i = 0; i < size; ++i) {
        v[i] = i / 10.0;
    }
}

void sub_func_sse(double* v, int start_idx) {
    __m256d vector = _mm256_loadu_pd(v + start_idx);
    __m256d base = _mm256_set_pd(2.0, 2.0, 2.0, 2.0);
    for (int i = 0; i < 128; ++i) {
        vector = _mm256_mul_pd(vector, base);
    }
    _mm256_storeu_pd(v + start_idx, vector);
}

void sub_func(double& item) {
    for (int k = 0; k < 128; ++k) {
        item *= 2.0;
    }
}

int main() {
    const size_t size = 8;
    double* v = new double[size];
    init(v, size);
    const int num_repeat = 2000;//I should repeat my measuraments 
                               //because I want to get average time - it is more clear information
    double total_time_sse = 0;
    for (int p = 0; p < num_repeat; ++p) {
        init(v, size);
        TimerHc t;
        t.restart();
        for (int i = 0; i < size; i += 8) {
            sub_func_sse(v, i);
        }
        total_time_sse += t.toc();
    }

    double total_time = 0;
    for (int p = 0; p < num_repeat; ++p) {
        init(v, size);
        TimerHc t;
        t.restart();
        for (int i = 0; i < size; ++i) {
            sub_func(v[i]);
        }
        total_time += t.toc();
    }
    std::cout << "time using sse = " << total_time_sse / num_repeat << std::endl <<
        "time without sse = " << total_time / num_repeat << std::endl;
    system("pause");
}

Solution

I defined performance like (time without sse) / (time using sse).

What you measure is speedup.

The speedup you can expect from applying parallelizations is modelled by Amdahl's law. It relates the savings in those parts that can be made faster (by parallelization or other means) to the total speedup. Amdahl's law can be rather intimidating, because it basically says that making parts faster will not always gain you a total speedup. The limit in achievable speedup is determined by the relative fraction of the workload that can be parallelized.

Gustavon's law takes a different point of view. In a nutshell, it states that you just have to increase the workload to make efficient use of parallelization. More workload in total has typically less impact on overhead from parallelization and the non-parallel part of computations, hence (according to Amdahl's law) results in more efficient use of parallelism.

...and in some sense, that's what you are observing here. The bigger your array, the more impact parallelization has.

PS: This is just some handwaving to explain why the effect you see is not too surprising. Luckily there is another answer which addresses your specific benchmark in more detail.