Search code examples
pythonmachine-learninglogistic-regression

Including features when implementing a logistic regression model


For some context, I am trying to do some matching to see if the company matches with another company. I've done feature engineering on the data such as names match, address match, domain match, etc...

From there, I've also created another feature that based on a methodology which contains a combination of names matched, address match, and domain match and setting weights on them based on intuition on what I deem to be more important in determining a match. Let's call this feature 'final score'. This score gives me a rough estimate on whether these are a matches.

Now comes the part where I implemented a logistic regression. I tried building a logistic regression with the features I've engineered WITHOUT the 'final score' and WITH 'final score' and the results were quite similar.

Note: I did check to see the features importance and it was highly important.

My question is when training a model, is it good practice to include 'final score' as a feature for the logistic regression model?


Solution

  • In general, you do not want highly correlated features in linear and logistic regression type models. It has no effect on performance but it affects interpretation of your model.

    This problem is known as Multicollinearity and is caused due to unstable (high variance) estimates of parameters (coefficients).

    You can look at this answer to know its cause.

    I can provide an intuitive example, where it can cause trouble:

    Y = P(scoring a goal by football player in a match)
    Feature vector = [weight, height] # height and weight are highly correlated
    

    Then the model learned could be:

    log(P(goal)/P(1-goal)) =  0.55*weight- 0.12*height + bias
    
    # how would you interpret the negative coefficient of height now?
    

    There are ways (regularization) to deal with this as well as there are situations where such correlated features can be safely used.