I am currently trying to run a linear model on a large data set, but am running into issues with some specific variables.
pv_model <- lm(SalePrice ~ MSSubClass + LotConfig + GarageArea + LotFrontage, data = train)
summary(pv_model)
Here is code for my regression. SalePrice, MSSubClass, GarageArea, and LotFrontage are all numeric fields, while LotConfig is a factored variable.
Here is the output of my pv_model:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 98154.64 17235.51 5.695 1.75e-08 ***
MSSubClass 50.05 58.38 0.857 0.391539
LotConfigCulDSac 69949.50 12740.62 5.490 5.42e-08 ***
LotConfigFR2 19998.34 14592.31 1.370 0.170932
LotConfigFR3 21390.99 34126.44 0.627 0.530962
LotConfigInside 21666.04 5597.33 3.871 0.000118 ***
GarageArea 175.67 10.96 16.035 < 2e-16 ***
LotFrontage101 42571.20 42664.89 0.998 0.318682
LotFrontage102 26051.49 35876.54 0.726 0.467968
LotFrontage103 36528.81 35967.56 1.016 0.310131
LotFrontage104 218129.42 58129.56 3.752 0.000188 ***
LotFrontage105 61737.12 27618.21 2.235 0.025673 *
LotFrontage106 40806.22 58159.42 0.702 0.483120
LotFrontage107 36744.69 29494.94 1.246 0.213211
LotFrontage108 71537.30 42565.91 1.681 0.093234 .
LotFrontage109 -29193.02 42528.98 -0.686 0.492647
LotFrontage110 73589.28 27706.92 2.656 0.008068 **
As you can see, the first variables operate correctly. Both the factored and numeric fields respond appropriately. That is, until it gets to LotFrontage. For whatever reason, the model runs the regression on every single level of LotFrontage.
For reference, LotFrontage describes the square footage of the subject's front yard. I have properly cleaned the data and replaced NA values. I really am at a loss for why this particular column is acting so unusually.
Any help is greatly appreciated.
If I download the data from the kaggle link or use a github link and do:
train = read.csv("train.csv")
class(x$LotFrontage)
[1] "integer"
pv_model <- lm(SalePrice ~ MSSubClass + LotConfig + GarageArea + LotFrontage,
data = train)
summary(pv_model)
Call:
lm(formula = SalePrice ~ MSSubClass + LotConfig + GarageArea +
LotFrontage, data = train)
Residuals:
Min 1Q Median 3Q Max
-380310 -33812 -4418 24345 487970
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 11915.866 9455.677 1.260 0.20785
MSSubClass 105.699 45.345 2.331 0.01992 *
LotConfigCulDSac 81789.113 10547.120 7.755 1.89e-14 ***
LotConfigFR2 17736.355 11787.227 1.505 0.13266
LotConfigFR3 17649.409 31418.281 0.562 0.57439
LotConfigInside 13073.201 5002.092 2.614 0.00907 **
GarageArea 208.708 8.725 23.920 < 2e-16 ***
LotFrontage 722.380 88.294 8.182 7.12e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Suggest that you read in the csv again like above.