fftw: Why is my 2D DFT output not the same as the output from taking 1D DFT of each row?

For my project, I have to take the DFT of a large 2D input matrix, process it, then convert it back with an IDFT and compare the results to the input matrix. My problem is in the 2D DFT step. I wrote a test with a small simple data set, which I execute in main(). I use the Eigen library for matrices and vectors. The output is this:

Using testM in a myTransform object:
1 2 3
4 5 6
7 8 9
calculateDFT took 33 Microseconds
DFT
          (6,0)
(-1.5,0.866025)
(-1.5,0.866025)
         (15,0)
(-1.5,0.866025)
(-1.5,0.866025)
         (24,0)
(-1.5,0.866025)
(-1.5,0.866025)
IDFT
1 2 3 4 5 6 7 8 9
Using testM in a myTransform2D object:
1 2 3
4 5 6
7 8 9
Default myTransform object created
DFT2D
          (45,0)  (-4.5,-2.59808)          (45,-0)
          (45,0) (-13.5,-7.79423)          (45,-0)
          (45,0)           (0,-0)          (45,-0)
IDFT
27.5 -0.5 -1.5 -8.5  8.5  7.5   -7   10    9

In the snippets below, this->N = this->nRows * this->nCols. The results of Test 1 and Test 2 should be the same, but they are obvisouly different. I have read the documentation over and over and still can't find why it's going wrong. fftw does row-major multi-dimensional transforms, in is filled per row of the matrix. The transfer_output function does not do anything with the values themselves, only converts the standard array to an Eigen::Matrix. Where am I going wrong? Any help would be very much appreciated. I have also tried to find similar posts here but none had my problem as far as I could find.

void test()
{
    RowVectorXf test(9);
    test << 1, 2, 3, 4, 5, 6, 7, 8, 9;
    // Prep matrix 
    Map<Matrix<float, 3, 3, RowMajor>> testM(test.data()); // convert test to rowmajor matrix

    // Test 1: feed the matrix to a myTransform object and take 1D DFTs and 1D IDFTs 
    std::cout << "Using testM in a myTransform object:\n" << testM << std::endl;
    myTransform testX1D(testM, 0);
    testX1D.vectorise();
    testX1D.calculateDFT();
    testX1D.calculateIDFT();
    std::cout << "DFT" << std::endl << testX1D.dft << std::endl;
    std::cout << "IDFT" << std::endl << testX1D.idft << std::endl; // works, too.

    .. Test 2: Feed the matrix to a myTransform2D object and take the 2D DFT and IDFT.
    std::cout << "Using testM in a myTransform2D object:\n" << testM << std::endl;
    myTransform2D testX(testM, 0); // 2D version 
    testX.vectorise(); // stored in testX.m which will hold the same as test but in a colmajor vector.
    testX.calculateDFT(); // where it goes wrong?
    std::cout << "DFT2D" << std::endl << testX.dft2D << std::endl;
    testX.calculateIDFT();
    std::cout << "IDFT" << std::endl << testX.idft << std::endl;
}

This is how I calculate the DFTs in each case, using the fftw library (fftwf because I use single precision to save memory and the values of the non-test data are on the order of -10000 to 10000, so I think that is not an issue).

void myTransform::calculateDFT()
/// Calculates discrete fourier transform of vectorised data `m`.
/** uses the FFTW library (https://fftw.org). The dft is stored in myTransform::dft*/
{
    //std::cout << m << std::endl;
    fftwf_complex* out;
    fftwf_plan p;
    out = (fftwf_complex*)fftw_malloc(sizeof(fftwf_complex) * this->nCols);
    float* in = new float[static_cast<const float&>(this->nCols)];
    p = fftwf_plan_dft_r2c_1d(this->nCols, in, out, FFTW_ESTIMATE);
    // calculate DFT for each trace and assign it to a segment of this->dft
    unsigned int factor = 0;
    auto check = std::chrono::high_resolution_clock::now();
    for (int k = 0; k < this->nRows; k++)
    {
        factor = k * this->nCols;
        //TODO: if possible, fix this slow in[i] = ... part. 
        for (int i = 0; i < this->nCols; i++)
        {
            in[i] = this->m[factor + i];
        }
        p = fftwf_plan_dft_r2c_1d(this->nCols, in, out, FFTW_ESTIMATE);
        fftwf_execute(p);
        this->transfer_output(out, k); // does nothing but add the output to a vector dft. 
    }
    delete [] in;
    fftwf_free(out);
    fftwf_destroy_plan(p);
}

And for the 2D DFT case: Here I use std::complex as specified on fftw.org. I allocate nRows * (nCols/2 + 1) single-precision floats as instructed here. For the 1D case, this is done in the 1D transfer_output function where dft is filled with out[this->nCols - i] for i > this->nCols/2

void myTransform2D::calculateDFT()
/// Should calculate the DFT in 2D with fftwf_plan_dft_r2c_2d(n0, n1, *in, *out, flags).
{
    std::complex<float>* out;
    fftwf_plan p;
    out = (std::complex<float>*)fftwf_malloc(sizeof(std::complex<float>) * this->nRows * (this->nCols/2+1)); // Hermitian symmetry for r2c transforms
    float* in = new float[this->N];
    in = (float*)fftwf_malloc(sizeof(float) * this->N);
    p = fftwf_plan_dft_r2c_2d(this->nRows, this->nCols, in, reinterpret_cast<fftwf_complex*>(out), FFTW_ESTIMATE);
    // Fill input array
    for (int i = 0; i < this->nRows; i++)
    {
        int factor = i * this->nCols;
        for (int j = 0; j < this->nCols; j++)
        {
            in[factor + j] = this->m[factor + j];
        }
    }
    fftwf_execute(p);
    transfer_output(out);
    fftwf_free(in);
    fftwf_free(out);
    fftwf_destroy_plan(p);
}

I convert back to the time domain using the IDFT, again in both 1D and 2D. I don't know for sure if the 2D version works, as the DFT goes wrong. The 1D case works, so I show the 2D case only.

void myTransform2D::calculateIDFT()
/// Calculates inverse fourier transform of `this->dft2D`.
/** Also uses the FFTW library. Results might not be perfect as we use floats
    instead of doubles because of large data sizes. */
{
    float* out = new float[this->N];
    std::complex<float>* in; 
    fftwf_plan pr;
    in = (std::complex<float>*)fftwf_malloc(sizeof(std::complex<float>) * this->N);
    out = (float*)fftwf_malloc(sizeof(float) * this->N);
    pr = fftwf_plan_dft_c2r_2d(this->nRows, this->nCols, reinterpret_cast<fftwf_complex*>(in), out, FFTW_ESTIMATE);

    for (int i = 0; i < this->nRows; i++)
    {
        for (int j = 0; j < this->nCols; j++)
        {
            in[i * this->nCols + j] = this->dft2D(i, j);
        }
    }

    fftwf_execute(pr);

    for (int i = 0; i < this->N; i++)
    {
        this->idft[i] = out[i] / this->N; // fftw does unnormalized inverse transforms.
    }

    fftwf_free(out);
    fftwf_free(in);
    fftwf_destroy_plan(pr);
}

EDIT: removed some code as suggested EDIT2: removed image, added contents as text.

Solution

It is hard to answer this question without full reproducible example that anyone can compile and test. So I will give the code that performs 2D forward and backward transforms and reference results.

Forward.

const auto fft_size = n * (n / 2 + 1);
auto out = (std::complex<float>*)fftwf_malloc(sizeof(std::complex<float>) * fft_size);
auto p = fftwf_plan_dft_r2c_2d(n, n, in, (fftwf_complex*)(out), FFTW_ESTIMATE);
fftwf_execute(p);

for (std::size_t i = 0; i < fft_size; ++i)
    std::cout << *(out + i) << ' ';

For the matrix in the row-major order

1 2 3
4 5 6
7 8 9

the correct output is:

(45,0) (-4.5,2.59808) (-13.5,7.79423) (0,0) (-13.5,-7.79423) (0,0)

Backward.

auto in2 = (float*)fftwf_malloc(sizeof(float) * n * n);
auto p2 = fftwf_plan_dft_c2r_2d(n, n, (fftwf_complex*)(out), in2, FFTW_ESTIMATE);
fftwf_execute(p2);

for (std::size_t row = 0; row < n; ++row) {
    for (std::size_t col = 0; col < n; ++col)
        std::cout << *(in2 + col + row * n) / (n * n) << ' ';
    std::cout << std::endl;
}

This outputs the original matrix.

Note that the output size of the forward transform (fft_size) is n * (n / 2 + 1) and not n * n. In your output I see 9 complex entries instead of 6. The size of in in the function calculateIDFT() is also wrong, and the way you copy values into it is probably wrong, too.