pairwise similarity comparison matlab

I have a matrix A containing events and the related probability that they occours. for example

A= [1, 0.6; 5, 0.3; 4, 0.1]

event 1 occours with a probability of 60%, event 5 with 30% and event 4 with 1%.

Then I have a series of similar matrices (events-probabilities)

B = [1,0.5; 3,0.4; 2,0.1]
C = [2,0.9; 4,0.1; 3,0]
D = [1,0.6; 5,0.3; 4,0.1]

and I would like to find a vector showing the similarity of A with each of the others matrices.

SIM = [?,?,1]

the first 2 elements contains the similarity between A and B and between A and C. The 3rd element shows the similarity between A and D (1 because they are the same).

Do you have any suggestion on how to implement the function to do the pairwise comparisons between the matrices?

Thanks a lot!!!

please consider also the case where A is A = [3,1;5,0;2,0] (that would be equal to have A=[3,1;2,0;1,0] etc...)

Solution

Function for similarity calculation between `A` and `B`

function SIM = SIMcalc(A,B)

%// Get joint unique events for A and B
unq_events = unique([A(:,1);B(:,1)]).'; %//'

%// Presence of events across joint unique events
event_tagA = bsxfun(@eq,A(:,1),unq_events);
event_tagB = bsxfun(@eq,B(:,1),unq_events);

%// Probabilities corresponding to each joint event
tagged_probA = sum(bsxfun(@times,A(:,2),event_tagA));
tagged_probB = sum(bsxfun(@times,B(:,2),event_tagB));

%// Set not-shared events as NaN
tagged_probA(~any(event_tagA))=nan;
tagged_probB(~any(event_tagB))=nan;

%// Get the similarity factors for each shared event. This is based on the
%// assumption that probabilities far apart must have a low shared
%// similarity factor. This factor would be later on used to scale the
%// individual probabilties for A and B.
sim_factor = 1-abs(tagged_probA-tagged_probB);
tagged_probA_sim_scaled = tagged_probA.*sim_factor;
tagged_probB_sim_scaled = tagged_probB.*sim_factor;

%// Get a concatenated matrix of scaled probabilities
tagged_probAB_sim_scaled = [tagged_probA_sim_scaled;tagged_probB_sim_scaled];

%// Get a hybrid array of probabilities based on the mean of probabilities
%// across A and B. Notice that for cases with identical probabilities, the
%// hybrid values would stay the same.
hybrid_probAB = mean(tagged_probAB_sim_scaled);

%// Get the sum of hybrid values. Notice that the sum would result in a
%// value of 1 when we have identical probabilities for identical events
SIM = nansum(hybrid_probAB);

return;

Sample inputs to test out the similarity calculations

%// Case 1 - First exammple from the question with D replacing B.
%// The SIM value must be 1 as mentioned in the question
disp('------------- Case 1 -----------------')
A= [1, 0.6; 5, 0.3; 4, 0.1]
B = [1,0.6; 5,0.3; 4,0.1]
SIM = SIMcalc(A,B)

%// Case 2 - Slight change to the first example with event 5 being
%// replaced by event 2 in B
%// The SIM value must be lesser than 1 as mentioned in the question
disp('------------- Case 2 -----------------')
A= [1, 0.6; 5, 0.3; 4, 0.1]
B = [1,0.6; 2,0.3; 4,0.1]
SIM = SIMcalc(A,B)

%// Case 3 - As presented in the comments by OP, that the SIM value must be 0
disp('------------- Case 3 -----------------')
A =[3,1;2,0;1,0]
B =[2,1;1,0;4,0]
SIM = SIMcalc(A,B)

%// Case 4 - As asked by me and replied by OP that SIM must be 1
disp('------------- Case 4 -----------------')
A =[3,1;2,0;1,0]
B =[3,1;2,0;1,0]
SIM = SIMcalc(A,B)

%// Case 5 - Random case added on my own.
%// As can be seen event 3 is common between A and B. Apart from event3,
%// only event 2 is common, but the probabilities arew far apart, so the
%// net SIM value must be slightly more than the identical probability of
%// event 3, i.e. slightly more than 0.55
 disp('------------- Case 5 -----------------')
A =[3,0.55;2,0.95;1,0]
B =[3,0.55;2,0.05;4,0.4]
SIM = SIMcalc(A,B)

Results

------------- Case 1 -----------------
A =
    1.0000    0.6000
    5.0000    0.3000
    4.0000    0.1000
B =
    1.0000    0.6000
    5.0000    0.3000
    4.0000    0.1000
SIM =
     1
------------- Case 2 -----------------
A =
    1.0000    0.6000
    5.0000    0.3000
    4.0000    0.1000
B =
    1.0000    0.6000
    2.0000    0.3000
    4.0000    0.1000
SIM =
    0.7000
------------- Case 3 -----------------
A =
     3     1
     2     0
     1     0
B =
     2     1
     1     0
     4     0
SIM =
     0
------------- Case 4 -----------------
A =
     3     1
     2     0
     1     0
B =
     3     1
     2     0
     1     0
SIM =
     1
------------- Case 5 -----------------
A =
    3.0000    0.5500
    2.0000    0.9500
    1.0000         0
B =
    3.0000    0.5500
    2.0000    0.0500
    4.0000    0.4000
SIM =
    0.6000

Explanation

Let's take case 5 to explain in detail the underlying principle that decides on the final scalar value that measures the similarity between A and B. It is suggested to run the code for this case and watch the values of the variables.

Inputs

A =
    3.0000    0.5500
    2.0000    0.9500
    1.0000         0
B =
    3.0000    0.5500
    2.0000    0.0500
    4.0000    0.4000

Step 1

Tag the probabilities for A and B corresponding to their events, so that the events that are not common are put as NaNs. Thus, we would have tagged_probA and tagged_probB and their values are shown below -

Event 1  Event 2  Event 3  Event 4
   0      0.95     0.55     NaN
  NaN     0.05     0.55     0.4

Step 2

Calculate the difference between the probabilities and then subtract the result from 1. Thus, a number closer to 1 would mean degree of similarity. For example, in this example for event 3, we would have the result as 1. This forms the basis of finding the similarity criteria between A and B, because we are getting 1 for identical probabilities and lesser values as the probabilities are far apart on the scale of [0 1]. This is stored into sim_factor –

sim_factor =
       NaN    0.1000    1.0000       NaN

Step 3

Scale the tagged probabilities for A and B using sim_factor. Thus, we have the tagged probabilities scaled according to the similarities between A and B. These are –

tagged_probA_sim_scaled =
       NaN    0.0950    0.5500       NaN
tagged_probB_sim_scaled =
       NaN    0.0050    0.5500       NaN

Step 4

Since the final value is supposed to be just a scalar value, we can get the average values across the tagged and scaled probabilities. The resultant values would have the same values with as individual probabilities for identical probabilities case, as for event 3 in this example. For the not identical cases, it would scale down the probabilities based on the dissimilarity between A and B probabilities. This is hybrid_probAB as shown below -

hybrid_probAB =
       NaN    0.0500    0.5500       NaN

Step 5

Sum the non-NaN elements from hybrid_probAB, to give us the final scalar similarity value, which is lesser than 1 for this specific case. It would have given us a perfect 1 for a case with identical probabilities.

Concluding remarks