python numpy matplotlib scipy logistic-regression

Parameters of a sigmoid regression in Python + scipy

I have a Python array containing dates representing the number of occurrences of a phenomenon in a particular year. This vector contains 200 different dates repeated a certain number of times each. Repetitions are the number of occurrences of the phenomenon. I managed to calculate and plot the cumulative sum with matplotlib with the following code snippet:

counts = arange(0, len(list_of_dates))
# Add the cumulative sum to the plot (list_of_dates contains repetitions)
plt.plot(list_of_dates, counts, linewidth=3.0)

Cumulative sum (in blue) per date

In blue, you can see the curve depicting the cumulative sum, in other colors the parameters I would like to obtain. However, I need the mathematical representation of the blue curve in order to obtain those parameters. I know that this type of curves can be adjusted using logistic regression, however, I do not understand how to do this in Python.

First I tried to use LogisticRegression from Scikit-learn, but then I realized they seem to be using this model for machine learning classification (and other stuff alike), which is not what I want.
Then I thought I could go directly to the definition of logistic function and try to build it by myself. I found this thread where it is recommended the use of scipy.special.expit to calculate the curve. It seems this function is already implemented, so I decided to use it. So I did this:

target_vector = dictionary.values() Y = expit(target_vector) plt.plot(list_of_dates, y, linewidth=3.0)

I got a vector back with 209 elements (same as target_vector) that look like this: [ 1. 0.98201379 0.95257413 0.73105858 ... 0.98201379 1. ]. However, the graphical output looks like if a child had been scratching a paper, not as a nice sigmoid curve like in the picture.

I also checked other Stack Overflow threads (this, this), but I guess the thing I need to do is just a toy example compared to them. I only need the math formula to calculate some quick and simple parameters.

Is there a way of doing this and getting the mathematical representation of the sigmoidal function?

Thank you very much!

Solution

Using this post and the comments posted yesterday, I came up with the following code:

from scipy.optimize import curve_fit
import matplotlib.pyplot as plt
import numpy as np
from sklearn.preprocessing import normalize # Added this new line

# This is how I normalized the vector. "ydata" looked like this:
# original_ ydata = [ 1, 3, 8, 14, 12, 27, 33, 36, 87, 136, 77, 57, 32, 31, 28, 24, 12, 2 ]
# The curve was NOT fitting using this values, so I found a function in 
# scikit-learn that normalizes (multidim) arrays: [normalize][2]

# m = []
# m.append(original_ydata)
# ydata = normalize(m, norm='l2') * 10

# Why 10? This function is converting my original values in a range 
# going from [0.00014, ..., 0.002 ] or something similar. So "curve_fit" 
# couldn't find anything but a horizontal line crossing y = 1. 
# I tried multiplying by 5, 6, ..., 12, and I realized that 10 is 
# the maximum value that lets the maximum value of my array below 1.00, like 0.97599. 

# Length of both arrays is 209
# Y-axis data has been normalized BUT then multiplied by 10
ydata = array([  5.09124776e-04,   1.01824955e-03, ... , 9.75992196e-01])
xdata = array(range(0,len(ydata),1))

def sigmoid(x, x0, k):
    y = 1 / (1+ np.exp(-k*(x-x0)))
    return y

popt, pcov = curve_fit(sigmoid, xdata, ydata)

x = np.linspace(0, 250, 250)
y = sigmoid(x, *popt)

plt.plot(xdata, ydata, 'o', label='data')
plt.plot(x,y, linewidth=3.0, label='fit')
plt.ylim(0, 1.25)
plt.legend(loc='best')

# This (m, b, C) parameters not sure on where they are... popt, pcov? 
# y = C * sigmoid(m*x + b)

This program creates the plot you can see below. As you can see it is a fair adjustment, but I guess if I change the definition of Y in sigmoid function, by adding a C multipliying the first 1, probably I would get a better adjustment. Still on that.

Sigmoid curve fitting

It seems normalizing data (as suggested by Ben Kuhn in comments) is a required step, otherwise the curve is not created. However, if your values are normalized to very low values (near to zero) the curve is not drawn either. So I multiplied the normalized vector per 10, to scale it to bigger units. Then the program simply found the curve. I can't explain why, as I am a total novice on this. Please, note that this is only my personal experience, I do not say this is a rule.

If I print popt and pcov I obtain:

#> print popt
[  8.56332788e+01   6.53678132e-02]

#> print pcov
[[  1.65450283e-01   1.27146184e-07]
 [  1.27146184e-07   2.34426866e-06]]

And the documentation on curve_fit says those parameters contain the "Optimal values for the parameters so that the sum of the squared error is minimized" and the covariance of the previous parameter.

Is any of those 6 values the parameters that characterize the sigmoid curve? Because if so, then the question is very close to be solved! :-)

Thanks a lot!