Search code examples
pythonpandasgini

adding new pandas df column based on operations row-wise


I have a Dataframe like this:

Interesting           genre_1        probabilities
    1    no            Empty        0.251306
    2    yes           Empty        0.042043
    3     no          Alternative    5.871099
    4    yes         Alternative    5.723896
    5    no           Blues         0.027028
    6    yes          Blues         0.120248
    7    no          Children's     0.207213
    8    yes         Children's     0.426679
    9    no          Classical      0.306316
    10    yes         Classical      1.044135

I would like to perform GINI index on the same category based on the interesting column. After that, I would like to add such a value in a new pandas column.

This is the function to get the Gini index:

#Gini Function
#a and b are the quantities of each class
def gini(a,b):
    a1 = (a/(a+b))**2
    b1 = (b/(a+b))**2
    return 1 - (a1 + b1) 

EDIT* SORRY I had an error in my final desired Dataframe. Being interesting or not matters when it comes to choose prob(A) and prob(B) but the Gini score will be the same, because it will measure how much impurity are we getting to classify a song as interesting or not. So if the probabilities are around 50/50% then it will mean that the Gini score will reach it maximum (0.5) and this is because is equally possible to just be mistaken to choose interesting or not.

So for the first two rows, the Gini index will be:

a=no; b=Empty -> gini(0.251306, 0.042043)= 0.245559831601612
a=yes; b=Empty -> gini(0.042043, 0.251306)= 0.245559831601612

Then I would like to get something like:

 Interesting           genre_1        percentages.  GINI INDEX
        1    no            Empty        0.251306         0.245559831601612
        2    yes           Empty        0.042043         0.245559831601612
        3     no          Alternative    5.871099         0.4999194135183881
        4    yes         Alternative    5.723896.     0.4999194135183881
        5    no           Blues         0.027028          ..
        6    yes          Blues         0.120248
        7    no          Children's     0.207213
        8    yes         Children's     0.426679
        9    no          Classical      0.306316          ..
        10    yes         Classical      1.044135         ..

Solution

  • Ok, I think I know what you mean. The code below does not care, if the Interesting value is 'yes' or 'no'. But what you want, is to calculate the GINI coefficient in two different ways for each row based on the value in the Interesting value of that row. So if interesting == no, then the result is 0.5, because a == b. But if interesting is 'yes', then you need to use a = probability[i] and b = probability[i+1]. So skip this section for the updated code below.

    import pandas as pd
    
    
    df = pd.read_csv('df.txt',delim_whitespace=True)
    
    probs = df['probabilities']
    
    
    def ROLLING_GINI(probabilities):
    
        a1 = (probabilities[0]/(probabilities[0]+probabilities[0]))**2
        b1 = (probabilities[0]/(probabilities[0]+probabilities[0]))**2
        res = 1 - (a1 + b1)
        yield res
    
        for i in range(len(probabilities)-1):
            a1 = (probabilities[i]/(probabilities[i]+probabilities[i+1]))**2
            b1 = (probabilities[i+1]/(probabilities[i]+probabilities[i+1]))**2
            res = 1 - (a1 + b1)
            yield res
    
    
    df['GINI'] = [val for val in ROLLING_GINI(probs)]
    
    print(df)
    

    This is where the real trouble starts, because if I understand your idea correctly, then you cannot calculate the last GINI value, because your dataframe won't allow it. The important bit here is that the last Interesting value in your dataframe is 'yes'. This means I have to use a = probability[i] and b = probability[i+1]. But your dataframe doesn't have a row number 11. You have 10 rows and on row i == 10, you'd need a probability in row 11 to calculate a GINI coefficient. So in order for your idea to work, the last Interesting value MUST be 'no', otherwise you will always get an index error.

    Here's the code anyways:

    import pandas as pd
    
    df = pd.read_csv('df.txt',delim_whitespace=True)
    
    
    def ROLLING_GINI(dataframe):
    
        probabilities = dataframe['probabilities']
        how_to_calculate = dataframe['Interesting']
    
        for i in range(len(dataframe)-1):
    
            if how_to_calculate[i] == 'yes':
                a1 = (probabilities[i]/(probabilities[i]+probabilities[i+1]))**2
                b1 = (probabilities[i+1]/(probabilities[i]+probabilities[i+1]))**2
                res = 1 - (a1 + b1)
                yield res
    
            elif how_to_calculate[i] == 'no':
                a1 = (probabilities[i]/(probabilities[i]+probabilities[i]))**2
                b1 = (probabilities[i]/(probabilities[i]+probabilities[i]))**2
                res = 1 - (a1 + b1)
                yield res
    
    
    GINI = [val for val in ROLLING_GINI(df)]
    
    print('All GINI coefficients: %s'%GINI)
    print('Length of all calculatable GINI coefficients: %s'%len(GINI))
    print('Number of rows in the dataframe: %s'%len(df))
    print('The last Interesting value is: %s'%df.iloc[-1,0])
    

    EDIT NUMBER THREE (Sorry for the late realization):

    So it does work if I apply the indexing correctly. The problem was that I wanted to use the Next probability, not the previous one. So it's a = probabilities[i-1] and b = probabilities[i]

    import pandas as pd
    
    df = pd.read_csv('df.txt',delim_whitespace=True)
    
    
    def ROLLING_GINI(dataframe):
    
        probabilities = dataframe['probabilities']
        how_to_calculate = dataframe['Interesting']
    
        for i in range(len(dataframe)):
    
            if how_to_calculate[i] == 'yes':
                a1 = (probabilities[i-1]/(probabilities[i-1]+probabilities[i]))**2
                b1 = (probabilities[i]/(probabilities[i-1]+probabilities[i]))**2
                res = 1 - (a1 + b1)
                yield res
    
            elif how_to_calculate[i] == 'no':
                a1 = (probabilities[i]/(probabilities[i]+probabilities[i]))**2
                b1 = (probabilities[i]/(probabilities[i]+probabilities[i]))**2
                res = 1 - (a1 + b1)
                yield res
    
    
    GINI = [val for val in ROLLING_GINI(df)]
    
    print('All GINI coefficients: %s'%GINI)
    print('Length of all calculatable GINI coefficients: %s'%len(GINI))
    print('Number of rows in the dataframe: %s'%len(df))
    print('The last Interesting value is: %s'%df.iloc[-1,0])