Search code examples
machine-learningstatisticsregressionnormal-distribution

Why Normality is considered as an important assumption for dependent and independent variables?


While going through one kernel on Kaggle regarding Regression in that it was mentioned that the data should look like a normal distribution. But I am not getting why? I know this question might be very basic But please help me to understand this concept.

Thanks in Advance!!


Solution

  • Regression models make a number of assumptions, one of which is normality. When this assumption is violated then your p-values and confidence intervals around your coefficient estimate could be wrong, leading to incorrect conclusions about the statistical significance of your predictors

    However, a common misconception is that the data (i.e. the variables/predictors) needs to be normally distributed, but this is not true. These models don't make any assumptions about the distribution of predictors.

    For example, imagine a case where you have a binary predictor in regression (Male/Female; Slow/Fast etc.) - it would be impossible for this variable to be normally distributed and yet it is still a valid predictor to use in a regression model. The normality assumption actually refers to the distribution of the residuals, not the predictors themselves