Search code examples
distancesimilaritycategorical-data

How can we measure the similarity distance between categorical data ?


How can we measure the similarity distance between categorical data ?

Example: Gender: Male, Female Numerical values: [0 - 100], [200 - 300] Strings: Professionals, beginners, etc,...

Thanks in advance.


Solution

  • There are different ways to do this. One of the simplest would be as follows.

    1) Assign numeric value to each property so the order matches the meaning behind the property if possible. It is important to order property values from lower to higher if property can be measured. If it is not possible and property is categorical (like gender, profession, etc), just assign number to each possible value.

    P1 - Gender
    -------------------
    0 - Male
    1 - Female
    
    P2 - Experience
    -----------
    0 - Beginner
    5 - Average
    10 - Professional
    
    P3 - Age
    -----------
    [0 - 100]
    
    P4 - Body height, cm
    -----------
    [50 - 250]
    

    2) For each concept find scale factor and offset so all property values fall in the same chosen range, say [0-100]

    Sx = 100 / (Px max - Px min)
    Ox = -Px min
    

    In sample provided you would get:

    S1 = 100
    O1 = 0
    
    S2 = 10
    O2 = 0
    
    S3 = 1
    O3 = 0
    
    S4 = 0.5
    O4 = -50
    

    3) Now you can create a vector containing all the property values.

    V = (S1 * P1 + O1, S2 * P2 + O2, S3 * P3 + O3, S4 * P4 + O4)
    

    In sample provided it would be:

    V = (100 * P1, 10 * P2, P3, 0.5 * P4 - 50)
    

    4) Now you can compare two vectors V1 and V2 by subtracting one from other. The length of resulting vector will tell how different they are.

    delta = |V1 - V2|
    

    Vectors are subtracted by subtracting each dimension. Vector length can be calculated as square root of sum of squared vector dimensions.

    Imagine we have 3 persons:

    John
    P1 = 0 (male)
    P2 = 0 (beginner)
    P3 = 20 (20 years old)
    P4 = 190 (body height is 190 cm)
    
    Kevin
    P1 = 0 (male)
    P2 = 10 (professional)
    P3 = 25 (25 years old)
    P4 = 186 (body height is 186 cm)
    
    Lea
    P1 = 1 (female)
    P2 = 10 (professional)
    P3 = 40 (40 years old)
    P4 = 178 (body height is 178 cm)
    

    Vectors would be:

    J = (100 * 0, 10 * 0, 20, 0.5 * 190 - 50) = (0, 0, 20, 45)
    K = (100 * 0, 10 * 10, 25, 0.5 * 186 - 50) = (0, 100, 25, 43)
    L = (100 * 1, 10 * 10, 40, 0.5 * 178 - 50) = (100, 100, 40, 39)
    

    To determine we need to subtract vectors:

    delta JK = |J - K| =
    = |(0 - 0, 0 - 100, 20 - 25, 45 - 43)| = 
    = |(0, -100, -5, 2)| =
    = SQRT(0 ^ 2 + (-100) ^ 2 + (-5) ^ 2 + 2 ^ 2) = 
    = SQRT(10000 + 25 + 4) = 
    = 100,14
    
    delta KL = |K - L| = 
    = |(0 - 100, 100 - 100, 25 - 40, 43 - 39)| = 
    = |(-100, 0, -15, 4)| =
    = SQRT((-100) ^ 2 + 0 ^ 2 + (-15) ^ 2 + 4 ^ 2) =
    = SQRT(10000 + 225 + 16) =
    = 101,20
    
    delta LJ = |L - J| = 
    = |(100 - 0, 100 - 0, 40 - 20, 39 - 45)| = 
    = |(100, 100, 20, -6)| =
    = SQRT(100 ^ 2 + 100 ^ 2 + (20) ^ 2 + (-6) ^ 2) =
    = SQRT(10000 + 10000 + 400 + 36) =
    = 142,95
    

    From this you can see that John and Kevin are more similar than any other as delta is smaller.