Firstly, the code used.
ggplot(correlation, aes(x=area_ha, y=extent_2000_ha))
+ geom_point( color="green") + theme_ipsum()
+ theme(text=element_text(family="Times New Roman", size=14))
+ scale_y_continuous(labels=function(n){format(n, scientific = FALSE)})
+ scale_y_continuous(labels=scales::comma)
+ geom_smooth(method=lm, color="red", se=FALSE)
When I want to include the linear trend and the confidence interval (attached graphs), negative values appear on OY (-200.000). I must mention that all values are positive, no negative values.
If it only makes sense for your regression line to be strictly positive, then a standard linear regression is just not the right model for your data. A linear regression will simply find the line which minimizes the squared distances from your data to the line. It does not care if this means the line becomes negative where you think it shouldn't be. This is an extra constraint that you need to build into your model, and this is dependent on the context and phyical interpretation of your data (which we can only guess at from the information in your question).
For example, you could consider a linear regression with a fixed intercept of 0:
ggplot(correlation, aes(x = area_ha, y = extent_2000_ha)) +
geom_point( color = "green4", alpha = 0.2) +
theme_ipsum() +
theme(text=element_text(family="Times New Roman", size=14)) +
scale_y_continuous(labels = scales::comma) +
scale_x_continuous(labels = scales::comma, limits = c(0, 8e5)) +
geom_smooth(method = "lm", color = "red3", formula = y ~ x + 0,
fullrange = TRUE, alpha = 0.2)
Or perhaps a generalised linear model with a log-link function:
ggplot(correlation, aes(x = area_ha, y = extent_2000_ha)) +
geom_point( color = "green4", alpha = 0.2) +
theme_ipsum() +
theme(text=element_text(family="Times New Roman", size=14)) +
scale_y_continuous(labels = scales::comma) +
scale_x_continuous(labels = scales::comma, limits = c(0, 8e5)) +
geom_smooth(method = "glm", color = "red", method.args = list(family = poisson),
fullrange = TRUE)
As for which model is best, that is a statistics question rather than a programming question. It's therefore off-topic here, but may be on topic at CrossValidated.
Data used
There was no data included in the question, so I created a similar dataset from the following code to create the above examples
library(ggplot2)
library(hrbrthemes)
set.seed(2)
correlation <- data.frame(area_ha = runif(100, 1, 8e5))
correlation$extent_2000_ha <- (correlation$area_ha +
rnorm(100, 0, correlation$area_ha))^2/9e6
correlation <- correlation[correlation$extent_2000_ha > 1e4,]
correlation <- correlation[correlation$extent_2000_ha < 1e6,]
correlation <- round(correlation)