Search code examples
assemblyx86masmmatrix-multiplication

Matrix Multiplication in Assembly


I'm writing some code for matrix multiplication in assembly language. I cannot use variables and only storage on the stack what i need. The algorithm seems working right, but i have problems with IMUL and MOV using registers in the last two blocks of code. I post my code here:

        unsigned int m = 3; // raws of mat1
        unsigned int n = 2; // columns of mat1
        unsigned int k = 4; // columns of mat2
        short int mat1[] = { -1,-2, 4,5, 4,-2 }; // first matrix
        short int mat2[] = { 2,0,4,6, 0,2,-1,3 }; // second matrix
        int mat3[1024]; // output matrix

        __asm {

            XOR EAX, EAX    //mat1 raws counter
            XOR EBX, EBX    //mat2 columns counter
            XOR EDX, EDX    //mat1 columns(equal to mat2 raws) counter
            XOR EDI, EDI    //will contain sum of multiplications to be copied into output matrix

            Loop1 :         //determinates the raws of output matrix: mat3
            XOR EBX, EBX    //at the end of first raw, column counter is resetted
                CMP m, EAX  //if loopped mat1 m-raws times...       
                JZ Finale   //...algortihm is over
                INC EAX     //increase mat1 raws counter 
                JMP Loop2

            Loop2 :             //determinates the columns of mat3
            XOR EDX, EDX        //at the end of the n-sums, mat1 column counter is resetted
                XOR EDI, EDI    //after sum of n-multiplications edi is resetted
                CMP k, EBX      //if multiplications/sums on this raw have been done...
                JZ Loop1        //...go to next output matrix raw
                INC EBX         //increase mat2 columns counter
                JMP Loop3

            Loop3 :         //determinates elements of mat3
            CMP n, EDX      //if the n-multiplacations/sums on first n-elements have been done...
                JZ Loop2    //...skip to next n-elements
                INC EDX     //increase counter of the elements that will be multiplicate
                JMP Stuffs  //go to operations code block

            Stuffs :                                        //here code generates mat3 elements
    #58     MOV SI, mat1[2 * ((EAX - 1) * 2 + (EDX - 1)]    //moves to SI the [m-raws/n-clomumn] element of mat1
    #59         IMUL SI, mat2[2 * ((EBX - 1) * 2 + (EDX - 1)]   //multiplicates(with sign) SI and [n-raws/k-column] element of mat2
                ADD DI, SI                                  //sum the result in edi
                CMP n, EDX                                  //check the sums
                JZ CopyResult                               //if n-sums have been done...
                JMP Loop3                                   //...go to copy result into mat3

            CopyResult :
    #66     MOV mat3[4 * ((EAX - 1) * 4 + (EBX - 1))], EDI  //copy value to output matrix mat3
                JMP Loop3                                   //go to next n-elements

            Finale :
    }

    {
        unsigned int i, j, h;

        printf("Output matrix:\n");
        for (i = h = 0; i < m; i++) {
            for (j = 0; j < k; j++, h++)
                printf("%6d ", mat3[h]);
            printf("\n");
        }

    }

In this code, compiler reports two types of error referring IMUL and MOV for mat1, mat2 and mat3. Here they are:

  • Line 58 - Error C2404 'EDX': illegal register in 'second operand'
  • Line 58 - Error C2430 more than one index register in 'second operand'

The same errors for lines 59 and 66, with EDX and EBX registers.

Is this algortihm basically good? (i tested setting manually some indexes, and then the last one, during debug and it was good but i couldn't totally test it).

I think that the first error depends on the second one, but if i cannot use this way the registers, what can i do to compute output?


Solution

  • Instead of trying to scale multiple registers by two in an addressing mode (which is impossible), just use add eax, 2 instead of inc eax.

    Also, since your output matrix uses 32-bit int, you should be doing 32-bit math. You're generating a value in DI and then storing that plus whatever garbage is in the high half of EDI with line #66.

    Something like movsx esi, word ptr [rowstart + column] /
    movsx eax, word ptr [offset_in_column + row] / imul eax, esi might work for (part of) the body of the inner loop. I'll let you sort out incrementing by columns in the first addressing-mode, and incrementing by rows in the 2nd addressing mode.


    I think your algorithm is probably sane, based on what I think you're trying to do. For each element of the output matrix, loop over a column in one matrix and a row in the other matrix. So you only ever store once to each element of the output matrix. Whether your loops actually do that or not, IDK: It hurts my brain how ugly the branching is. (look at optimizing compiler output for a loop sometime, then for a double or triple nested loop. e.g. on http://gcc.godbolt.org/).


    The other ways to nest the loops might be better or worse for performance with large matrices, but the only really good ways to do this (for large matrices) involve transposing one of the input matrices so you can loop over contiguous memory elements in both matrices at once (since transposing costs O(n^2) time, but speeds up the O(n^3) step which traverses the transposed array repeatedly, because it gives more cache hits).

    (Given how common floating point matmul is in scientific computing, this is a topic that's been studied extensively, with much effort put into experimental tuning of code. See various implementations of the DGEMM function in BLAS.)