Search code examples
pythonmachine-learningoptimizationplotmathematical-optimization

Optimization of equation parameter values such that largest distance between groups is created


For a particular gene scoring system I would like to set up a rudimentary plot such that new sample values that are entered immediately gravitate, based on multiple gene measurements, towards either a healthy or unhealthy group within the plot. Let's presume we have 5 people, each having 6 genes measured.

Import pandas as pd
import matplotlib.pyplot as plt
import numpy as np


df = pd.DataFrame(np.array([[A, 1, 1.2, 1.4, 2, 2], [B, 1.5, 1, 1.4, 1.3, 1.2], [C, 1, 1.2, 1.6, 2, 1.4], [D, 1.7, 1.5, 1.5, 1.5, 1.4], [E, 1.6, 1.9, 1.8, 3, 2.5], [F, 2, 2.2, 1.9, 2, 2]]), columns=['Gene', 'Healthy 1', 'Healthy 2', 'Healthy 3', 'Unhealthy 1', 'Unhealthy 2'])

This creates the following table:

Gene Healthy 1 Healthy 2 Healthy 3 Unhealthy 1 Unhealthy 2
A 1.0 1.2 1.4 2.0 2.0
B 1.5 1.0 1.4 1.3 1.2
C 1.0 1.2 1.6 2.0 1.4
D 1.7 1.5 1.5 1.5 1.4
E 1.6 1.9 1.8 3.0 2.5
F 2.0 2.2 1.9 2.0 2.0

The X and Y coordinates of each sample are then calculated based on adding the contribution of the genes together after multiplying it's parameter/weight * measured value. The first 4 genes contribute towards the Y value, whilst gene 5 and 6 determine the X value. wA - wF are the parameter/weights associated with their gene A-F counterpart.

wA = .15 
wB = .25
wC = .35
wD = .45
wE = .50
wF = .60

n=0

for n in range (5):

y1 = df.iat[0,n]
y2 = df.iat[1,n]
y3 = df.iat[2,n]
y4 = df.iat[3,n]

TrueY = wA*y1+wB*y2+wC*y3+wD*y4

x1 = df.iat[4,n]
x2 = df.iat[5,n]

TrueX = (wE*x1+wF*x2)

result = (TrueX, TrueY)

n += 1

label = f"({TrueX},{TrueY})"

plt.scatter(TrueX, TrueY, alpha=0.5)
plt.annotate(label, (TrueX,TrueY), textcoords="offset points", xytext=(0,10), ha='center')

We thus calculate all the coordinates and plot them

Plot

What I would now like to do is find out how I can optimize the wA-wF parameter/weights such that the healthy samples are pushed towards the origin of the plot, let's say (0.0), whilst the unhealthy samples are pushed towards a reasonable opposite point, let's say (1,1). I've looked into K-means/SVM, but as a novice-coder/biochemist I was thoroughly overwhelmed and would appreciate any help available.


Solution

  • Here's an example using scipy.optimize combined with your code. (Since your code contains some syntax and type errors, I had to make small corrections.)

    import pandas as pd
    import matplotlib.pyplot as plt
    import numpy as np
    
    df = pd.DataFrame(np.array([[1, 1.2, 1.4, 2, 2],
                                [1.5, 1, 1.4, 1.3, 1.2],
                                [1, 1.2, 1.6, 2, 1.4],
                                [1.7, 1.5, 1.5, 1.5, 1.4],
                                [1.6, 1.9, 1.8, 3, 2.5],
                                [2, 2.2, 1.9, 2, 2]]),
                      columns=['Healthy 1', 'Healthy 2', 'Healthy 3', 'Unhealthy 1', 'Unhealthy 2'],
                      index=[['A', 'B', 'C', 'D', 'E', 'F']])
    
    wA = .15
    wB = .25
    wC = .35
    wD = .45
    wE = .50
    wF = .60
    
    from scipy.optimize import minimize
    
    # use your given weights as the initial guess
    w0 = np.array([wA, wB, wC, wD, wE, wF])
    
    # the objective function to be minimized
    # - it computes the (square of) the samples' distances to (0,0) resp. (1,1)
    def fun(w):
        weighted = df.values*w[:, None] # multiply all sample values by their weight
        y = sum(weighted[:4])           # compute all 5 "TrueY" coordinates
        x = sum(weighted[4:])           # compute all 5 "TrueX" coordinates
        y[3:] -= 1                      # adjust the "Unhealthy" y to the target (x,1)
        x[3:] -= 1                      # adjust the "Unhealthy" x to the target (1,y)
        return sum(x**2+y**2)           # return the sum of (squared) distances
    
    res = minimize(fun, w0)
    print(res)
    
    # assign the optimized weights back to your parameters
    wA, wB, wC, wD, wE, wF = res.x
    
    # this is mostly your unchanged code
    for n in range (5):
    
        y1 = df.iat[0,n]
        y2 = df.iat[1,n]
        y3 = df.iat[2,n]
        y4 = df.iat[3,n]
    
        TrueY = wA*y1+wB*y2+wC*y3+wD*y4
    
        x1 = df.iat[4,n]
        x2 = df.iat[5,n]
    
        TrueX = (wE*x1+wF*x2)
    
        result = (TrueX, TrueY)
    
        label = f"({TrueX:.3f},{TrueY:.3f})"
    
        plt.scatter(TrueX, TrueY, alpha=0.5)
        plt.annotate(label, (TrueX,TrueY), textcoords="offset points", xytext=(0,10), ha='center')
    
    plt.savefig("mygraph.png")
    

    This yields the parameters [ 1.21773653, 0.22185886, -0.39377451, -0.76513658, 0.86984207, -0.73166533] as the solution array. Therewith we can see the healthy samples clustered around (0,0) and the unhealthy samples around (1,1):

    You may want to experiment with other optimization methods - see scipy.optimize.minimize.