Search code examples
pythonsortingbusiness-intelligence

Selective sorting


I'm a Python newbie and I would like to implement a contingency table that deals with binary or categorical lists (that models the features of a dataset). For those who don't know, a contingency table is a matrix that in the generical element m_ij has a number that specifies how much times the element i of the first feature is in the same osservation of the element j of the second feature. It's clear that every element (taken once) of each features should become a row or column header. My problem is when I deal with binary feature. In this case, the contingency table must have as headers the couple (1,0) in this rigid sequence.

_|1|0|
1| | |
0| | |

While, with the code I've written this rigidity is not guaranteed: if binary feature has a 0 as first element, the relative header will not start with 1.

See my code:

def compute_contingency_table(first_f, second_f):
'''
This method compute contingency table of two features
:param first_f: first feature
:param second_f: second feature
:return: the contingency table
'''

first_values = get_values(first_f)
second_values = get_values(second_f)
contingency_table = np.zeros([len(first_values), len(second_values)])
corresponding_values = []

# for each value of the first feature
for h in range(len(first_values)):

    # find all the indeces in which it occurs
    f_indices = [i for i, x in enumerate(first_f) if x == second_f[h]]

    # save the corresponding values in the second feature
    for ind in f_indices:
        corresponding_values.append(second_f[ind])

    # createing contingency_table
    # for each value in corresponding values of the second feature
    for val in corresponding_values:
        # take its index in the values list (i.e. the column of contingency table)
        k = second_values.index(val)

        # increment the value of the corresponding contingency table element
        contingency_table[h, k] += 1

    del corresponding_values[:]

return contingency_table

Use case:

first_f=[1,0,0,0,0,0,0]
second_f=[0,1,0,0,0,1,0]

Contingency table output by my code:

[[ 4.  2.]
 [ 1.  0.]]

While it should be:

 [[ 0.  1.]
 [ 2.  4.]]

As you can see, this happens because the output table is of type

_|0|1|
0| | |
1| | |

It should work if it sorts headers in (1,0)-way with binary; no sort if they are caterogical. That is what I mean for selective sort.


Solution

  • Done in this way:

    def compute_contingency_table(first_f, second_f):
    '''
    This method compute contingency table of two features
    :param first_f: first feature
    :param second_f: second feature
    :return: the contingency table
    '''
    
    
    first_values = get_values(first_f)
    second_values = get_values(second_f)
    
    if first_values == [0,1]:
        first_values = [1,0]
    if second_values == [0,1]:
        second_values = [1,0]
    
    contingency_table = np.zeros([len(first_values), len(second_values)])
    corrisponding_values = []
    for i in range(len(first_values)):
    
        f_indices = [k for k, x in enumerate(first_f) if x == first_values[i]]
        for ind in f_indices:
            corrisponding_values.append(second_f[ind])
    
        for s_val in corrisponding_values:
            k = second_values.index(s_val)
            contingency_table[i, k] += 1
        del corrisponding_values[:]
    
    return contingency_table
    

    Use case 1:

    hair=['black', 'blonde', 'red', 'blonde', 'red', 'red', 'brown']
    country = ['usa', 'china', 'usa', 'germany', 'germany','china', 'usa']
    print(compute_contingency_table(hair,country))
    

    OUTPUT

    [[ 1.  0.  0.]
     [ 0.  1.  1.]
     [ 1.  1.  1.]
     [ 1.  0.  0.]]
    

    Use case 2:

    a = [1, 0, 0, 0, 0, 0, 0]
    b = [0, 0, 0, 1, 1, 0, 0]
    print(compute_contingency_table(a,b))
    

    OUTPUT

    [[ 0.  1.]
     [ 2.  4.]]