I perform regression analysis and try to find the best fit model for the dataset diamonds.csv in ggplot2
. I use price(response variable) vs carat and I perform linear regression, quadratic, and cubic regression. The line is not the best fit. I realize the logarithmic from excel has the best fitting line. However, I couldn't figure out how to code in R to find the logarithmic fitting line. Anyone can help?
model<-lm(price~carat, data = diamonds)
model2<-lm(price~carat + I(carat^2), data = diamonds)
model3 <- lm(price~carat + I(carat^2) + I(carat^3), data = diamonds)
How can I code the log in R to get same result as excel?
y = 0.4299ln(x) - 2.5495 R² = 0.8468
Thanks!
The result you report from excel y = 0.4299ln(x) - 2.5495
does not contain any polynomial or cubic terms. What are you trying to do? price
is very skewed and as with say 'income' it is common practice to take the log from that. This also provides the R2 you are referring to, but very different coefficients for the intercept and carat parameter.
m1 <- lm(log(price) ~ carat, data = diamonds)
summary(m1)
Call:
lm(formula = log(price) ~ carat, data = diamonds)
Residuals:
Min 1Q Median 3Q Max
-6.2844 -0.2449 0.0335 0.2578 1.5642
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.215021 0.003348 1856 <2e-16 ***
carat 1.969757 0.003608 546 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.3972 on 53938 degrees of freedom
Multiple R-squared: 0.8468, Adjusted R-squared: 0.8468
F-statistic: 2.981e+05 on 1 and 53938 DF, p-value: < 2.2e-16