Segfaults with Intel Intrinsics

I have the following function using Intel intrinsics:

int c_lattice_worker( int lm, double* inArr, double* outArr, int arrLen,
             double sin_,  double cos_ ) {
  int xi, yi;
  double x, y;
  __m128d _msin, _mcos;
  __m128d _m0, _m1;
  _msin = _mm_loaddup_pd( &sin_ );
  _mcos = _mm_loaddup_pd( &cos_ );

  for ( int xnc = lm; xnc < (arrLen - (lm << 1)); xnc += 2 ) {
    _m0 = _mm_load_pd( &inArr[ xnc ] );
    _m1 = _mm_shuffle_pd( _m0, _m0, 0x1 );
    _m0 = _mm_mul_pd( _msin, _m0 );
    _m1 = _mm_mul_pd( _mcos, _m1 );
    _m0 = _mm_addsub_pd( _m0, _m1 );
    _mm_store_sd( &outArr[ xnc + 1 ], _m0 ); // segfault here if lm == 1
    _m1 = _mm_shuffle_pd( _m0, _m0, 0x1 );
    _mm_store_sd( &outArr[ xnc     ], _m1 ); // segfault here if lm == 1
    }
  }

  // fliping the lm modifier
  return 1 - lm;
}

Arrays inArr and outArr have even length, lm is either 0 or 1. If it is 0 then everything works correctly, but if lm is 1 then _mm_store_sd cause the program to segfault (or, to put it differently, commenting out both of these lines makes the segfault go away). For lm == 1 the xnc index is not aligned to 16 bytes, but according to Intel's documentation 16-byte alignment is not required for _mm_store_sd, only for _mm_store_pd. I am clueless. Any suggestions?

Solution

It turns out that:

I can use _mm_storeu_pd to store two packed 64-bit floats into unaligned memory address.
But when I do this I must also use _mm_loadu_pd to load from unaligned memory address.

So in fact _mm_load_pd was causing the segfault but when I commented out the store operations it got optimized away because it was dead code.