python machine-learning optimization plot mathematical-optimization

Optimization of equation parameter values such that largest distance between groups is created

For a particular gene scoring system I would like to set up a rudimentary plot such that new sample values that are entered immediately gravitate, based on multiple gene measurements, towards either a healthy or unhealthy group within the plot. Let's presume we have 5 people, each having 6 genes measured.

Import pandas as pd
import matplotlib.pyplot as plt
import numpy as np


df = pd.DataFrame(np.array([[A, 1, 1.2, 1.4, 2, 2], [B, 1.5, 1, 1.4, 1.3, 1.2], [C, 1, 1.2, 1.6, 2, 1.4], [D, 1.7, 1.5, 1.5, 1.5, 1.4], [E, 1.6, 1.9, 1.8, 3, 2.5], [F, 2, 2.2, 1.9, 2, 2]]), columns=['Gene', 'Healthy 1', 'Healthy 2', 'Healthy 3', 'Unhealthy 1', 'Unhealthy 2'])

This creates the following table:

Gene	Healthy 1	Healthy 2	Healthy 3	Unhealthy 1	Unhealthy 2
A	1.0	1.2	1.4	2.0	2.0
B	1.5	1.0	1.4	1.3	1.2
C	1.0	1.2	1.6	2.0	1.4
D	1.7	1.5	1.5	1.5	1.4
E	1.6	1.9	1.8	3.0	2.5
F	2.0	2.2	1.9	2.0	2.0

The X and Y coordinates of each sample are then calculated based on adding the contribution of the genes together after multiplying it's parameter/weight * measured value. The first 4 genes contribute towards the Y value, whilst gene 5 and 6 determine the X value. wA - wF are the parameter/weights associated with their gene A-F counterpart.

wA = .15 
wB = .25
wC = .35
wD = .45
wE = .50
wF = .60

n=0

for n in range (5):

y1 = df.iat[0,n]
y2 = df.iat[1,n]
y3 = df.iat[2,n]
y4 = df.iat[3,n]

TrueY = wA*y1+wB*y2+wC*y3+wD*y4

x1 = df.iat[4,n]
x2 = df.iat[5,n]

TrueX = (wE*x1+wF*x2)

result = (TrueX, TrueY)

n += 1

label = f"({TrueX},{TrueY})"

plt.scatter(TrueX, TrueY, alpha=0.5)
plt.annotate(label, (TrueX,TrueY), textcoords="offset points", xytext=(0,10), ha='center')

We thus calculate all the coordinates and plot them

Plot

What I would now like to do is find out how I can optimize the wA-wF parameter/weights such that the healthy samples are pushed towards the origin of the plot, let's say (0.0), whilst the unhealthy samples are pushed towards a reasonable opposite point, let's say (1,1). I've looked into K-means/SVM, but as a novice-coder/biochemist I was thoroughly overwhelmed and would appreciate any help available.

Solution

Here's an example using scipy.optimize combined with your code. (Since your code contains some syntax and type errors, I had to make small corrections.)

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

df = pd.DataFrame(np.array([[1, 1.2, 1.4, 2, 2],
                            [1.5, 1, 1.4, 1.3, 1.2],
                            [1, 1.2, 1.6, 2, 1.4],
                            [1.7, 1.5, 1.5, 1.5, 1.4],
                            [1.6, 1.9, 1.8, 3, 2.5],
                            [2, 2.2, 1.9, 2, 2]]),
                  columns=['Healthy 1', 'Healthy 2', 'Healthy 3', 'Unhealthy 1', 'Unhealthy 2'],
                  index=[['A', 'B', 'C', 'D', 'E', 'F']])

wA = .15
wB = .25
wC = .35
wD = .45
wE = .50
wF = .60

from scipy.optimize import minimize

# use your given weights as the initial guess
w0 = np.array([wA, wB, wC, wD, wE, wF])

# the objective function to be minimized
# - it computes the (square of) the samples' distances to (0,0) resp. (1,1)
def fun(w):
    weighted = df.values*w[:, None] # multiply all sample values by their weight
    y = sum(weighted[:4])           # compute all 5 "TrueY" coordinates
    x = sum(weighted[4:])           # compute all 5 "TrueX" coordinates
    y[3:] -= 1                      # adjust the "Unhealthy" y to the target (x,1)
    x[3:] -= 1                      # adjust the "Unhealthy" x to the target (1,y)
    return sum(x**2+y**2)           # return the sum of (squared) distances

res = minimize(fun, w0)
print(res)

# assign the optimized weights back to your parameters
wA, wB, wC, wD, wE, wF = res.x

# this is mostly your unchanged code
for n in range (5):

    y1 = df.iat[0,n]
    y2 = df.iat[1,n]
    y3 = df.iat[2,n]
    y4 = df.iat[3,n]

    TrueY = wA*y1+wB*y2+wC*y3+wD*y4

    x1 = df.iat[4,n]
    x2 = df.iat[5,n]

    TrueX = (wE*x1+wF*x2)

    result = (TrueX, TrueY)

    label = f"({TrueX:.3f},{TrueY:.3f})"

    plt.scatter(TrueX, TrueY, alpha=0.5)
    plt.annotate(label, (TrueX,TrueY), textcoords="offset points", xytext=(0,10), ha='center')

plt.savefig("mygraph.png")

This yields the parameters [ 1.21773653, 0.22185886, -0.39377451, -0.76513658, 0.86984207, -0.73166533] as the solution array. Therewith we can see the healthy samples clustered around (0,0) and the unhealthy samples around (1,1):

You may want to experiment with other optimization methods - see scipy.optimize.minimize.

Gene	Healthy 1	Healthy 2	Healthy 3	Unhealthy 1	Unhealthy 2
A	1.0	1.2	1.4	2.0	2.0
B	1.5	1.0	1.4	1.3	1.2
C	1.0	1.2	1.6	2.0	1.4
D	1.7	1.5	1.5	1.5	1.4
E	1.6	1.9	1.8	3.0	2.5
F	2.0	2.2	1.9	2.0	2.0

Gene	Healthy 1	Healthy 2	Healthy 3	Unhealthy 1	Unhealthy 2
A	1.0	1.2	1.4	2.0	2.0
B	1.5	1.0	1.4	1.3	1.2
C	1.0	1.2	1.6	2.0	1.4
D	1.7	1.5	1.5	1.5	1.4
E	1.6	1.9	1.8	3.0	2.5
F	2.0	2.2	1.9	2.0	2.0

Gene	Healthy 1	Healthy 2	Healthy 3	Unhealthy 1	Unhealthy 2
A	1.0	1.2	1.4	2.0	2.0
B	1.5	1.0	1.4	1.3	1.2
C	1.0	1.2	1.6	2.0	1.4
D	1.7	1.5	1.5	1.5	1.4
E	1.6	1.9	1.8	3.0	2.5
F	2.0	2.2	1.9	2.0	2.0