Search code examples
pythondeep-learningclassificationdna-sequence

How can I use any classifier to classify my data with each data point consisting of a set of floating values?


I have data in this format-

[0.266465 0.9203907 1.007363 ... 0. 0.09623989 0.39632136]

It is the value of the first row and first column.

It is the value of the second column of the first row:

[0.9042176 1.135085 1.2988662 ... 0. 0.13614458 0.28000486]

I have 2200 such rows and I want to train a classifier to identify that if the two set of values are similar or not?

P.S.- These are extracted feature vector values.


Solution

  • If you assume relation between two extracted feature vectors to be linear, you could try using Pearson correlation:

    import numpy as np
    from scipy.stats import pearsonr
    
    list1 = np.random.random(100)
    list2 = np.random.random(100)
    
    pearsonr(list1, list2)
    

    An example output is:

    (0.0746901299996632, 0.4601843257734832)
    

    Where first value refers to correlation (7%), the second to its significance (with > 0,05 you accept the null hypothesis that the correlation is insignificant at significance level alfa = 5%). And if vectors are correlated, they are be in a way similar. More about the method here.

    Also, I came across Normalized Cross-Correlation that is used for identifying similarity between pictures (not an expert, so rather check this).