Search code examples
pythonpandasmachine-learningstatisticsdbscan

How to get the second derivative/dip from the graph or generate the best eps value


Dataset is below

 ,id,revenue ,profit
0,101,779183,281257
1,101,144829,838451
2,101,766465,757565
3,101,353297,261071
4,101,1615461,275760
5,101,246731,949229
6,101,951518,301016
7,101,444669,430583

Code is below

import pandas as pd;
from sklearn.cluster import DBSCAN
import matplotlib.pyplot as plt
import numpy as np
from sklearn.preprocessing import StandardScaler
import seaborn as sns
from sklearn.neighbors import NearestNeighbors
df = pd.read_csv('1.csv',index_col=None)
df1 = StandardScaler().fit_transform(df)
dbsc = DBSCAN(eps = 2.5, min_samples = 20).fit(df1)
labels = dbsc.labels_

My shape of df is 1999

I got the dip value eps value from the below method, from graph its clear that eps=2.5

enter image description here

Below is the method to find the best eps value

ns = 5
nbrs = NearestNeighbors(n_neighbors=ns).fit(df3)
distances, indices = nbrs.kneighbors(df3)
distanceDec = sorted(distances[:,ns-1], reverse=True)
plt.plot(indices[:,0], distanceDec)
#plt.plot(list(range(1,2000)), distanceDec)
  • How to find the dip in the graph automatically by the system mean best eps is expected out? without looking in to graph, my system has to tell best eps

Solution

  • If I understand correctly, you are looking for the precise y value of the inflection point appearing in your ε(x) plot (it should be around 2.0), right?

    If this is correct, being ε(x) your curve, the problem is reduced to:

    1. Compute the second derivative of your curve: ε''(x).
    2. Find the zero (or zeroes) of such second derivative: x0.
    3. Recover the optimized ε value, just by plugging the zero into your curve: ε(x0).

    Here I attach my answer, based in this two other Stack Overflow answers: https://stackoverflow.com/a/26042315/10489040 (Compute derivative of an array) https://stackoverflow.com/a/3843124/10489040 (Find zero in array)

    import numpy as np
    import matplotlib.pyplot as plt
    
    # Generating x data range from -1 to 4 with a step of 0.01
    x = np.arange(-1, 4, 0.01)
    
    # Simulating y data with an inflection point as y(x) = x³ - 5x² + 2x
    y = x**3 - 5*x**2 + 2*x
    
    # Plotting your curve
    plt.plot(x, y, label="y(x)")
    
    # Computing y 1st derivative of your curve with a step of 0.01 and plotting it
    y_1prime = np.gradient(y, 0.01)
    plt.plot(x, y_1prime, label="y'(x)")
    
    # Computing y 2nd derivative of your curve with a step of 0.01 and plotting it
    y_2prime = np.gradient(y_1prime, 0.01)
    plt.plot(x, y_2prime, label="y''(x)")
    
    # Finding the index of the zero (or zeroes) of your curve
    x_zero_index = np.where(np.diff(np.sign(y_2prime)))[0]
    
    # Finding the x value of the zero of your curve
    x_zero_value = x[x_zero_index][0]
    
    # Finding the y value corresponding to the x value of the zero
    y_zero_value = y[x_zero_index][0]
    
    # Reporting
    print(f'The inflection point of your curve is {y_zero_value:.3f}.')
    

    enter image description here

    In any case, keep in mind that the inflection point (around 2.0) does not match with the "dip" point appearing around 2.5.