Abstraction vs. performance when working with arrays

This is a question about a concern I have about choosing between better performance and clearer code (better abstraction) when dealing with arrays. I tried to distill it down to a toy example.

C++ is particularly good at allowing abstractions without hurting performance. The question is whether this is possible in examples similar to the one below.

Consider a trivial arbitrary-size matrix class that uses contiguous row-major storage:

#include <cmath>
#include <cassert>

class Matrix {
    int nrow, ncol;
    double *data;
public:
    Matrix(int nrow, int ncol) : nrow(nrow), ncol(ncol), data(new double[nrow*ncol]) { }
    ~Matrix() { delete [] data; }

    int rows() const { return nrow; }
    int cols() const { return ncol; }

    double & operator [] (int i) { return data[i]; }

    double & operator () (int i, int j) { return data[i*ncol + j]; }
};

It has a 2D indexing operator () to make it easy to work with. It also has operator [] for contiguous access, but a better-abstracted matrix may not have this.

Let's implement a function that takes an n-by-2 matrix, essentially a list of 2D vectors, and normalizes each vector in-place.

The clear way:

inline double veclen(double x, double y) {
    return std::sqrt(x*x + y*y);
}

void normalize(Matrix &mat) {
    assert(mat.cols() == 2); // some kind of check for correct input
    for (int i=0; i < mat.rows(); ++i) {
        double norm = veclen(mat(i,0), mat(i,1));
        mat(i,0) /= norm;
        mat(i,1) /= norm;
    }
}

The fast, but less clear way:

void normalize2(Matrix &mat) {
    assert(mat.cols() == 2);
    for (int i=0; i < mat.rows(); ++i) {
        double norm = veclen(mat[2*i], mat[2*i+1]);
        mat[2*i] /= norm;
        mat[2*i+1] /= norm;
    }
}

The second version (normalize2) has the potential to be faster because it is written in a way that it is clear that the second iteration of the loop will not access data that was computed in the first iteration. Thus it can potentially make better use of SIMD instructions. Looking at godbolt, this seems to be what happens (unless I'm misreading the assembly).

In the first version (normalize), the compiler can't know that the input matrix is not n-by-1, which would lead to overlapping array accesses.

Question: Is it possible to somehow tell the compiler that the input matrix is really n-by-2 in normalize() to allow it to optimize to the same level as it does in normalize2()?

Addressing the comments:

John Zwinck: I went and did the benchmark. normalize2() is considerably faster (2.4 vs 1.3 seconds), but only if I remove the assert macros or if I define NDEBUG. That is a rather counterintuitive effect of -DNDEBUG, isn't it? It reduces performance instead of improving it.
Max: Evidence is both the godbolt output I linked to and the above benchmark. I am also interested in the case when these two functions cannot be inlined (e.g. because they are in a separate translation unit).
Jarod42 and bolov: This is the answer I was looking for. Confirmed by the benchmark mentioned in the first point. Still, this is important to know in case one implements one's own assert (which is exactly what I do in my application).

Solution

An answer that is acceptable to me was essentially given by @bolov and @Jared42 in the comments. Since they did not post it, I will do so myself.

To let the compiler know that the matrix is of size n-by-2, it is sufficient to add code to the beginning of the function that makes the rest of the code unreachable when the matrix size is not correct.

For example, adding

if (mat.cols() != 2)
    throw std::runtime_error("Input array is not of expected shape.");

to the beginning of normalize() allows it to run exactly as fast as normalize2() (1.3 instead of 2.4 seconds in my benchmark with clang 5.0).

We can also add an assert(mat.cols() == 2), but this results in the counterintuitive effect that defining -DNDEBUG during compilation makes the function considerably slower (as it removes the assertion).