Search code examples
rcomparecol

Percentage of similarity between two colulmns


Lets say I have two columns:

A  B
1  1
2  2
3  4
4  4
5  4
6  6

Is there a way to calculate the percentage of similarity, so that in example above we find that columns A and B are 67% the same.


Solution

  • We could take the intersect of elements in 'A' and 'B', get its length and divide by the nrow of 'df1'

    paste0(round(100*length(intersect(df1$A, df1$B))/nrow(df1)), "%")
    #[1] "67%"
    

    If the comparison is between corresponding elements, use == instead of the intersect,sum the TRUE values from the logical output, divide by number of rows....

    paste0(round(100*with(df1, sum(A==B))/nrow(df1)), "%")
    #[1] "67%"
    

    Or just use mean

    paste0(round(100*with(df1, mean(A==B))), "%")
    #[1] "67%"
    

    NOTE: This is one of those examples where we get the same result by choosing any of the methods.