Search code examples
javaojalgo

ojAlgo MatrixR032 creation from string, normalization, and cosine similarity calculation


I have two comma delimited strings containing embeddings. Each index should be able to fit into a float and they are 128 elements long. I am following this linear algebra intro in the ojAlgo library. I'd like to convert the two strings to ojAlgo matrices, normalize them, and then compute their cosine similarity. I am testing with a single matrix first - I expect when I compute its cosine similarity it should be 1.0.

    PhysicalStore.Factory<Double, Primitive32Store> storeFactory = Primitive32Store.FACTORY;
    String dummyMatrixValues = "0.47058824,0.5647059,0.54901963,0.54509807,0.54901963";
    Primitive32Store matrixR032 = storeFactory.rows(Arrays.stream(dummyMatrixValues.split(","))
            .mapToDouble(Double::parseDouble)
            .toArray());
    System.out.println("Primitive32Store : " + matrixR032);
    matrixR032.modifyAny(DataProcessors.STANDARD_SCORE);
    System.out.println("Primitive32Store - normalized : " + matrixR032);
    System.out.println(matrixR032);
    System.out.println("matrixR032 " + matrixR032.multiply(storeFactory.make(matrixR032.transpose())));

 [java] Primitive32Store : org.ojalgo.matrix.store.Primitive32Store < 1 x 5 >
 [java] { { 0.47058823704719543,    0.5647059082984924, 0.5490196347236633, 0.545098066329956,  0.5490196347236633 } }
 [java] Primitive32Store - normalized : org.ojalgo.matrix.store.Primitive32Store < 1 x 5 >
 [java] { { NaN,    NaN,    NaN,    NaN,    NaN } }
 [java] org.ojalgo.matrix.store.Primitive32Store < 1 x 5 >
 [java] { { NaN,    NaN,    NaN,    NaN,    NaN } }
 [java] matrixR032 org.ojalgo.matrix.store.Primitive32Store < 1 x 1 >
 [java] { { NaN } }
 [java] 

however my normalization results in NaN AND the input numbers are given additional digits I did not specify?

  1. How can I ensure that when I convert from string->double[]->Primitive32Store additional digits are not added?
  2. How can I normalize my vector and compute its cosine similarity?

update: when I switch to MatrixR064 the number no longer has seemingly random digits added to the end


Solution

  • Primitive32Store matrixR032 = storeFactory.rows(Arrays.stream(dummyMatrixValues.split(","))
                .mapToDouble(Double::parseDouble)
                .toArray());
    

    Is a somewhat messy way to this - you don't really see what's going on. How about this way:

        String dummyMatrixValues = "0.47058824,0.5647059,0.54901963,0.54509807,0.54901963";
        String[] values = dummyMatrixValues.split(",");
    
        PhysicalStore.Factory<Double, Primitive32Store> factory = Primitive32Store.FACTORY;
    
        Primitive32Store vector = factory.make(values.length, 1);
    
        for (int i = 0; i < values.length; i++) {
            vector.set(i, 0, Double.parseDouble(values[i]));
        }
    
        vector.modifyAny(DataProcessors.STANDARD_SCORE);
    
        double norm = vector.norm();
        double dotp = vector.dot(vector);
        double similarity = dotp / (norm * norm);
    
        System.out.println("norm: " + norm);
        System.out.println("dotp: " + dotp);
        System.out.println("similarity: " + similarity);
    

    I assume the "additional digits" are representation errors. The 32 in the class name Primitive32Store indicated that it uses 32-bit float.

    The DataProcessors class assume data is stored in columns – in your case 1 columns 5 rows. You did the opposite (transposed).