Add a regression line to ggscatter plot but ignore grouping

I am using ggscatter on R to plot a pearson correlation between two variables. However, when I color points, it appears that one reg.line is computed for each different colors. What I want to do is to color y points in the plot according to the column named 'mycolor' but I want the regression line to be computed on the whole data, regardless of the color.

Here is the function I use, with color or without color :

df < - structure(list(my_x = c(131L, 100L, NA, 125L, 50L, 50L, 16L, 
3L, 27L, 96L, 176L, 121L, 129L, 84L, 67L, 35L, 36L, 18L, 29L, 
29L, 26L, 25L, 24L, 20L, 28L, 22L, 25L, 15L, 0L, 18L, 13L, 17L, 
14L, 23L, 27L, NA, 6L, 1L, 7L, 1L, 20L, 30L, 16L, 22L, 23L, 22L, 
17L, 12L, 14L, 28L, 16L, 20L, 44L, 27L, 16L, 6L, 10L, 9L, 16L, 
2L, 43L, 6L, 2L, 0L, 1L, 1L, 1L, 1L, 2L, 1L, 47L, 22L, 7L, 3L, 
4L, 3L, 1L, 1L, 1L, 4L, 4L, 1L, 25L, 3L, 3L, 3L, 6L, 6L, 4L, 
1L, 2L, 2L, 5L, 8L, 3L, 5L, 1L, 1L, 1L, 2L, 3L, 6L, 6L, 4L, 8L, 
1L, 4L, 1L, 5L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 2L, 1L, 0L, 0L, 
2L, 0L, 1L, 2L, 3L, 3L, 4L, 4L, 3L, 2L, 3L, 1L, 2L, 1L), my_y = c(134L, 
90L, 130L, 134L, 44L, 48L, 17L, 4L, 19L, 97L, 178L, 39L, 132L, 
90L, 35L, 35L, 36L, 18L, 28L, 14L, 25L, 26L, 24L, 18L, 25L, 22L, 
9L, 15L, 0L, 21L, 6L, 15L, 15L, 21L, 27L, 19L, 7L, 0L, 8L, 2L, 
10L, 30L, 19L, 23L, 12L, 23L, 16L, 6L, 14L, 29L, 15L, 12L, 21L, 
14L, 11L, 7L, 5L, 4L, 16L, 5L, 36L, 5L, 2L, 0L, 1L, 1L, 1L, 1L, 
2L, 1L, 50L, 22L, 7L, 3L, 6L, 3L, 1L, 1L, 1L, 4L, 4L, 1L, 21L, 
3L, 3L, 3L, 6L, 7L, 4L, 1L, 2L, 2L, 1L, 6L, 3L, 2L, 1L, 1L, 2L, 
2L, 3L, 2L, 6L, 7L, 6L, 1L, 4L, 1L, 5L, 2L, 1L, 2L, 2L, 2L, 2L, 
1L, 2L, 2L, 1L, 0L, 0L, 2L, 0L, 1L, 2L, 3L, 2L, 4L, 4L, 3L, 2L, 
3L, 1L, 2L, 1L), mycolor = c("color1", "color1", "color1", 
"color1", "color1", "color1", "color1", "color1", "color1", 
"color1", "color1", "color1", "color1", "color1", "color1", 
"color2", "color2", "color2", "color2", "color2", "color2", 
"color2", "color2", "color2", "color2", "color2", "color2", 
"color2", "color2", "color2", "color2", "color2", "color2", 
"color2", "color2", "color2", "color2", "color7", 
"Turtle", "Turtle", "color2", "color2", "color2", "color2", 
"color2", "color2", "color2", "color2", "color2", "color2", 
"color2", "color2", "color2", "color2", "color2", "color2", 
"color2", "color2", "color2", "color3", "color4", 
"color4", "color4", "color4", "color4", 
"color4", "color4", "color4", "color4", 
"color4", "color4", "color4", "color5", 
"color5", "color5", "color5", "color5", 
"color5", "color5", "color5", "color5", 
"color5", "color5", "color5", "color5", 
"color5", "color5", "color6", "color6", "color6", "color6", 
"color6", "color6", "color6", "color6", "color6", "color6", "color6", "color6", 
"color6", "color6", "color6", "color6", "color6", "color6", "color6", "color6", 
"color6", "color6", "color6", "color6", "color6", "color6", "color6", "color6", 
"color6", "color6", "color6", "color6", "color6", "color6", "color6", "color6", 
"color6", "color6", "color6", "color6", "color6", "color6", "color6", "color6", 
"color6", "color6", "color6", "color6")), class = "data.frame", row.names = c(NA, 
-135L))
df %>%
  ggscatter(., y="my_y", x="my_x",
            color="mycolor",
            add = "reg.line", conf.int = TRUE, 
            cor.coef = TRUE, cor.method = "pearson")


df %>%
  ggscatter(., y="my_y", x="my_x",
            add = "reg.line", conf.int = TRUE, 
            cor.coef = TRUE, cor.method = "pearson")

The two results :

Taking the example above, I basically want to have the plot on the left but replacing the regression lines with the regression line of the right plot

Is there anyway to do this with ggscatter or should I use ggplot2 geom_point and add the regression line myself ?

Thanks for any help !

Maxime

Solution

I do not see much advantage in using ggscatter() instead of ggplot(), so I add here an answer that does not use 'ggpubr'. Pearson correlation is the OLS (ordinary least squares) correlation, and it does not depend on which variable is the explanatory one and which the response one. The R² value from lm() is the same as the square of the r from cor.test(). In contrast, the fitted line does depend on which variable is mapped to xand which one to y aesthetics. Depending on the variables, a linear regression may not be a good approach and major axis regression should be used. If the variable mapped to x is measured without or with minimal error, or can be considered the cause of the response, then linear regression using lm() as method is the correct approach. However, if both variables are subject to random variation, lm() will result in different fitted lines depending on which of the two variables is arbitrarily mapped to x and which to y.

In the first example I show the same example as in the answer by @stefan but using the grammar of graphics to construct the plot. I use statistics from 'ggplot2' and from 'ggpmisc'. What do we gain: 1) we can have the colour mapping only in the plot layer that needs it, geom_point() (without overriding it later), 2) if we wish we can rewrite the code with a different order of the layers, say, plot the scatter on top of the regression line, 3) we gain a lot in flexibility because we can easily mix and match layer functions (geoms and stats from different packages extending 'ggplot2'). Once one understands that we are adding layers to the plot one by one, and that the aesthetics mapping in the call to ggplot() sets only the default for all layers, the intent of the code is clear. The code remains concise.

In the second example I use a different data set, and plot MPG in highway and city traffic, as an example of a case where using linear regression is unsuitable and some variation of major axis regression is preferable.

These examples make use of features from 'ggpmisc' (>= 0.5.0), and will not work with earlier versions.

library(ggplot2)
library(ggpmisc)
#> Loading required package: ggpp
#> 
#> Attaching package: 'ggpp'
#> The following object is masked from 'package:ggplot2':
#> 
#>     annotate

# y depends on x
ggplot(mtcars, aes(y=hp, x=mpg)) +
  geom_point(aes(color=factor(cyl))) +
  stat_correlation(use_label(c("R", "P"))) +
  stat_poly_line()


# both x and y depend on some common factors not plotted
ggplot(mpg, aes(y=hwy, x=cty)) +
  geom_point(aes(color=factor(cyl))) +
  stat_correlation(use_label(c("R", "P"))) +
  stat_ma_line()

^{Created on 2022-08-21 by the reprex package (v2.0.1)}

For simplicity, I kept the default theme_gray(), but adding + theme_classic() at the end of the examples above, will make the plots look as in the question. Alternatively, theme_set(theme_classic()) can be used to change the default theme for the current R session.

In both examples, for the correlation annotation I included values matching those in the question. Other labels are also available, including confidence intervals for r as well as for rank correlation. 'ggpmisc' also provides statistics for adding as annotations the equations of the fitted models.