Search code examples
pythonnumpygeometrycluster-analysiscomputational-geometry

Divide one-to-two x-y data into top and bottom sets


I have a data set with two y values associated with each x value. How can I divide the data into "upper" and "lower" values?

Below, I show an example with such a data set. I show an image of the desired "top" and "bottom" groupings (the red is the top and the purple is the bottom). My best idea so far is to find a line dividing the top and bottom data using an iterative approach.This solution is complicated and does not work very well, so I did not include it.

import matplotlib.pyplot as plt
import numpy as np

# construct data using piecewise functions
x1 = np.linspace(0, 0.7, 70)
x2 = np.linspace(0.7, 1, 30)
x3 = np.linspace(0.01, 0.999, 100)
y1 = 4.164 * x1 ** 3
y2 = 1 / x2
y3 = x3 ** 4 - 0.1

# concatenate data
x = np.concatenate([x1, x2, x3])
y = np.concatenate([y1, y2, y3])

# I want to be able divide the data by top and bottom,
#  like shown in the chart. The black is the unlabeled data
#  and the red and purple show the top and bottom
plt.scatter(x, y, marker='^', s=10, c='k')
plt.scatter(x1, y1, marker='x', s=0.8, c='r')
plt.scatter(x2, y2, marker='x', s=0.8, c='r')
plt.scatter(x3, y3, marker='x', s=0.8, c='purple')
plt.show()

enter image description here


Solution

  • You can create a dividing line by re-ordering your data. Sort everything by x then apply a Gaussian filter. The two data sets are strictly above or below the results of the Gaussian filter:

    import matplotlib.pyplot as plt
    from scipy.ndimage.filters import gaussian_filter1d
    import numpy as np
    
    # construct data using piecewise functions
    x1 = np.linspace(0, 0.7, 70)
    x2 = np.linspace(0.7, 1, 30)
    x3 = np.linspace(0.01, 0.999, 100)
    y1 = 4.164 * x1 ** 3
    y2 = 1 / x2
    y3 = x3 ** 4 - 0.1
    
    # concatenate data
    x = np.concatenate([x1, x2, x3])
    y = np.concatenate([y1, y2, y3])
    
    # I want to be able divide the data by top and bottom,
    #  like shown in the chart. The black is the unlabeled data
    #  and the red and purple show the top and bottom
    
    
    idx = np.argsort(x)
    newy = y[idx]
    newx = x[idx]
    gf = gaussian_filter1d(newy, 5)
    plt.scatter(x, y, marker='^', s=10, c='k')
    plt.scatter(x1, y1, marker='x', s=0.8, c='r')
    plt.scatter(x2, y2, marker='x', s=0.8, c='r')
    plt.scatter(x3, y3, marker='x', s=0.8, c='purple')
    plt.scatter(newx, gf, c='orange')
    plt.show()
    

    enter image description here