Search code examples
matlabprobabilitysimilarity

pairwise similarity comparison matlab


I have a matrix A containing events and the related probability that they occours. for example

A= [1, 0.6; 5, 0.3; 4, 0.1]

event 1 occours with a probability of 60%, event 5 with 30% and event 4 with 1%.

Then I have a series of similar matrices (events-probabilities)

B = [1,0.5; 3,0.4; 2,0.1]
C = [2,0.9; 4,0.1; 3,0]
D = [1,0.6; 5,0.3; 4,0.1]

and I would like to find a vector showing the similarity of A with each of the others matrices.

SIM = [?,?,1]

the first 2 elements contains the similarity between A and B and between A and C. The 3rd element shows the similarity between A and D (1 because they are the same).

Do you have any suggestion on how to implement the function to do the pairwise comparisons between the matrices?

Thanks a lot!!!

please consider also the case where A is A = [3,1;5,0;2,0] (that would be equal to have A=[3,1;2,0;1,0] etc...)


Solution

  • Function for similarity calculation between A and B

    function SIM = SIMcalc(A,B)
    
    %// Get joint unique events for A and B
    unq_events = unique([A(:,1);B(:,1)]).'; %//'
    
    %// Presence of events across joint unique events
    event_tagA = bsxfun(@eq,A(:,1),unq_events);
    event_tagB = bsxfun(@eq,B(:,1),unq_events);
    
    %// Probabilities corresponding to each joint event
    tagged_probA = sum(bsxfun(@times,A(:,2),event_tagA));
    tagged_probB = sum(bsxfun(@times,B(:,2),event_tagB));
    
    %// Set not-shared events as NaN
    tagged_probA(~any(event_tagA))=nan;
    tagged_probB(~any(event_tagB))=nan;
    
    %// Get the similarity factors for each shared event. This is based on the
    %// assumption that probabilities far apart must have a low shared
    %// similarity factor. This factor would be later on used to scale the
    %// individual probabilties for A and B.
    sim_factor = 1-abs(tagged_probA-tagged_probB);
    tagged_probA_sim_scaled = tagged_probA.*sim_factor;
    tagged_probB_sim_scaled = tagged_probB.*sim_factor;
    
    %// Get a concatenated matrix of scaled probabilities
    tagged_probAB_sim_scaled = [tagged_probA_sim_scaled;tagged_probB_sim_scaled];
    
    %// Get a hybrid array of probabilities based on the mean of probabilities
    %// across A and B. Notice that for cases with identical probabilities, the
    %// hybrid values would stay the same.
    hybrid_probAB = mean(tagged_probAB_sim_scaled);
    
    %// Get the sum of hybrid values. Notice that the sum would result in a
    %// value of 1 when we have identical probabilities for identical events
    SIM = nansum(hybrid_probAB);
    
    return;
    

    Sample inputs to test out the similarity calculations

    %// Case 1 - First exammple from the question with D replacing B.
    %// The SIM value must be 1 as mentioned in the question
    disp('------------- Case 1 -----------------')
    A= [1, 0.6; 5, 0.3; 4, 0.1]
    B = [1,0.6; 5,0.3; 4,0.1]
    SIM = SIMcalc(A,B)
    
    %// Case 2 - Slight change to the first example with event 5 being
    %// replaced by event 2 in B
    %// The SIM value must be lesser than 1 as mentioned in the question
    disp('------------- Case 2 -----------------')
    A= [1, 0.6; 5, 0.3; 4, 0.1]
    B = [1,0.6; 2,0.3; 4,0.1]
    SIM = SIMcalc(A,B)
    
    %// Case 3 - As presented in the comments by OP, that the SIM value must be 0
    disp('------------- Case 3 -----------------')
    A =[3,1;2,0;1,0]
    B =[2,1;1,0;4,0]
    SIM = SIMcalc(A,B)
    
    %// Case 4 - As asked by me and replied by OP that SIM must be 1
    disp('------------- Case 4 -----------------')
    A =[3,1;2,0;1,0]
    B =[3,1;2,0;1,0]
    SIM = SIMcalc(A,B)
    
    %// Case 5 - Random case added on my own.
    %// As can be seen event 3 is common between A and B. Apart from event3,
    %// only event 2 is common, but the probabilities arew far apart, so the
    %// net SIM value must be slightly more than the identical probability of
    %// event 3, i.e. slightly more than 0.55
     disp('------------- Case 5 -----------------')
    A =[3,0.55;2,0.95;1,0]
    B =[3,0.55;2,0.05;4,0.4]
    SIM = SIMcalc(A,B)
    

    Results

    ------------- Case 1 -----------------
    A =
        1.0000    0.6000
        5.0000    0.3000
        4.0000    0.1000
    B =
        1.0000    0.6000
        5.0000    0.3000
        4.0000    0.1000
    SIM =
         1
    ------------- Case 2 -----------------
    A =
        1.0000    0.6000
        5.0000    0.3000
        4.0000    0.1000
    B =
        1.0000    0.6000
        2.0000    0.3000
        4.0000    0.1000
    SIM =
        0.7000
    ------------- Case 3 -----------------
    A =
         3     1
         2     0
         1     0
    B =
         2     1
         1     0
         4     0
    SIM =
         0
    ------------- Case 4 -----------------
    A =
         3     1
         2     0
         1     0
    B =
         3     1
         2     0
         1     0
    SIM =
         1
    ------------- Case 5 -----------------
    A =
        3.0000    0.5500
        2.0000    0.9500
        1.0000         0
    B =
        3.0000    0.5500
        2.0000    0.0500
        4.0000    0.4000
    SIM =
        0.6000
    

    Explanation

    Let's take case 5 to explain in detail the underlying principle that decides on the final scalar value that measures the similarity between A and B. It is suggested to run the code for this case and watch the values of the variables.

    Inputs

    A =
        3.0000    0.5500
        2.0000    0.9500
        1.0000         0
    B =
        3.0000    0.5500
        2.0000    0.0500
        4.0000    0.4000
    

    Step 1

    Tag the probabilities for A and B corresponding to their events, so that the events that are not common are put as NaNs. Thus, we would have tagged_probA and tagged_probB and their values are shown below -

    Event 1  Event 2  Event 3  Event 4
       0      0.95     0.55     NaN
      NaN     0.05     0.55     0.4
    

    Step 2

    Calculate the difference between the probabilities and then subtract the result from 1. Thus, a number closer to 1 would mean degree of similarity. For example, in this example for event 3, we would have the result as 1. This forms the basis of finding the similarity criteria between A and B, because we are getting 1 for identical probabilities and lesser values as the probabilities are far apart on the scale of [0 1]. This is stored into sim_factor

    sim_factor =
           NaN    0.1000    1.0000       NaN
    

    Step 3

    Scale the tagged probabilities for A and B using sim_factor. Thus, we have the tagged probabilities scaled according to the similarities between A and B. These are –

    tagged_probA_sim_scaled =
           NaN    0.0950    0.5500       NaN
    tagged_probB_sim_scaled =
           NaN    0.0050    0.5500       NaN
    

    Step 4

    Since the final value is supposed to be just a scalar value, we can get the average values across the tagged and scaled probabilities. The resultant values would have the same values with as individual probabilities for identical probabilities case, as for event 3 in this example. For the not identical cases, it would scale down the probabilities based on the dissimilarity between A and B probabilities. This is hybrid_probAB as shown below -

    hybrid_probAB =
           NaN    0.0500    0.5500       NaN
    

    Step 5

    Sum the non-NaN elements from hybrid_probAB, to give us the final scalar similarity value, which is lesser than 1 for this specific case. It would have given us a perfect 1 for a case with identical probabilities.

    Concluding remarks

    Looking at the SIM values, they do follow the expected trends. So, hopefully it would work out for your other cases. To calculate similarity values between A and other arrays, please run the function with them as inputs.