Search code examples
pythonsequencematcher

Compare two dataframe columns with binary data


I have two columns with binary data (1s and 0s) And I want to check what's the percent similiarity between one column and the other. Obviously, as they are binary, it is important that the coincidence is based in the position of each cell, not in the global amount of 0s and 1s. In example:

column_1     column_2
   0            1
   1            1
   0            0
   1            0

In that case, in both columns there are the same equal number of 0s and 1s (which means a 100% coincidence) however, taking into account the order or position of each, there's just a 50% coincidence. That last steatment is the one I'm trying to figure out.

I know I could do it with a loop... however in case of larger lists that could be a problem.


Solution

  • This gets a binary vector that gives True where col 1 equals 2 and 0 else where, sums it up, and divides by the number of samples.

    sim = sum( df.column_1 == df.column_2 ) / len(df.column_1)