Search code examples
pythonscikit-learnscipystatisticspoisson

How to implement Poisson regression with scikit-learn for count data prediction


I'm looking for a way to couple Linear Regression with Poisson distribution. After a simple Linear Regression, its result is a numerical value that i would like to use in a Poisson Distribution, because i need the probability and i need the probability discrete for the number of events that occur successively and independently in a given time interval.

MY CASE IS: Having an independent variable Team_A__goal_scored = [1, 2, 2, 3, 1] and a dependent variable Team_B__goal_conceded = [2, 1, 3, 0, 1], i would like to calculate the probability that Team A can score 2 goals exact to Team B (so relate Team A's attack and Team B's defense). Running the regression and setting 2 goals as output with test = myfunc(2), i get 1.28:

import numpy
from scipy import stats

Team_B__goal_conceded = [1, 2, 2, 3, 1] #Var indipendent (x)
Team_A__goal_scored = [2, 1, 3, 0, 1] ##Var dipendent (y)

slope, intercept, r, p, std_err = stats.linregress(Team_B__goal_conceded, Team_A__goal_scored)

def myfunc(Team_B__goal_conceded):
   return slope * Team_B__goal_conceded + intercept

regression_scored_2 = myfunc(2) #result: 1.28

PROBLEM: The specific problem is that i need a probability, instead by performing a simple Linear Regression i obtain for example 1.28 which is not a probability.

To recap, I know the initial starting point with Team_B__goal_conceded and Team_A__goal_scored, i know what i want to achieve (probability of Team A scoring exactly 2 goals to Team B), but i don't know how i can achieve it. Having reflected that Linear Regression is not for me because i don't get a percentage, I thought that "maybe" a solution could be the Poisson Regression (also called Log-Linear Regression) could be useful to me, but I'm having trouble using it. I'm not sure if it is the solution to what I'm looking for, so I'm looking for someone who knows how to use it and tell me if it is useful for what I would like to achieve and who tells me how to use it. If this is not the case, then surely there are other ways to use the result of Linear Regression in a Poisson distribution.

EXPECTATIONS: The final output i want to get are the individual probabilities of a soccer team scoring 0 goals, 1 goal, 2 goals, 3 goals, etc, after running the regression. To be precise, i only need exactly 2 goals, so I would like this final output:

The probability that Team A scores exactly 2 goals against Team B is: x %

MY CODE: I tried to use sklearn.linear_model.PoissonRegressor, but it doesn't work properly. As I said above, I don't even know if this is the most suitable solution for my case or if there are better and more appropriate methods. If this is not the case, then surely there are other ways to use the result of Linear Regression in a Poisson distribution.

from sklearn import linear_model
clf = linear_model.PoissonRegressor()
Team_A__goal_scored = [1, 2, 2, 3, 1] #Var indipendent (x)
Team_B__goal_conceded = [2, 1, 3, 0, 1] #Var dipendent (y)

a=clf.fit(Team_A__goal_scored, Team_B__goal_conceded)
print("Fit a Generalized Linear Model: ", a, "\n")

b = clf.score(Team_A__goal_scored, Team_B__goal_conceded)
print("Compute D^2, the percentage of deviance explained: ", b, "\n")

c = clf.coef_
print("Estimated coefficients for the linear predictor: ", c, "\n")

d= clf.intercept_
print("Intercept (a.k.a. bias) added to linear predictor: ", d, "\n")

e=clf.predict([[1, 1], [3, 4]])
print("Predict using GLM with feature matrix X: ", e, "\n")

I get this error:

ValueError: Expected 2D array, got 1D array instead:
array=[1. 2. 2. 3. 1.].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

because should the lists be like (as the scikit-learn link shows) for example: X = [[1, 2], [2, 3], [3, 4], [4, 3]] and y = [12, 17, 22, 21], but i didn't understand how i could use mine in this way.

I also don't know how to set that the result must be exactly 2 goals (probability of Team A scoring exactly 2 goals to Team B)

Thank you


Solution

  • Your approach with PoissonRegressor is suitable for your problem, but it expects a 2D array for the features (X). You can reshape your 1D array using numpy.reshape:

    import numpy as np
    from sklearn.linear_model import PoissonRegressor
    
    Team_A__goal_scored = np.array([1, 2, 2, 3, 1]).reshape(-1, 1)
    Team_B__goal_conceded = np.array([2, 1, 3, 0, 1])
    

    Then you can fit the model:

    clf = PoissonRegressor()
    clf.fit(Team_A__goal_scored, Team_B__goal_conceded)
    

    Then you can make the prediction (e.g. Team_A__goal_scored = 2):

    lambda_pred = clf.predict(np.array([[2]]))[0]
    

    Then you can use the Poisson probability mass function to find the probability of scoring exactly 2 goals:

    from scipy.stats import poisson
    
    probability_two_goals = poisson.pmf(2, lambda_pred)
    

    This gives you the probability that Team A scores exactly 2 goals against Team B.