I have a plot where the for every x values there are 2 Y values. The data is also non-linear. The plot looks like this:
Now my question is I want to fit to regression curves separately to two of those curves (upper and lower). I know this is not a clear question as there is not unique identification scheme that I have in hand but I know the response system can behave in two different ways for a same input (or almost same input) at random.
The data file can be found here the headers used here 'sigma' and 'mabs_b'
Summary of full dataset:
summary(data)
# id sigma L_gal M_gal flux
# Min. : 1 Min. : 6.214 Min. :1.481e+06 Min. :1.541e+08 Min. : 10.4
# 1st Qu.: 5118 1st Qu.: 28.438 1st Qu.:1.814e+08 1st Qu.:1.290e+10 1st Qu.: 196.7
# Median :10236 Median : 41.542 Median :6.725e+08 Median :3.684e+10 Median : 388.0
# Mean :10236 Mean : 56.599 Mean :3.151e+09 Mean :3.663e+11 Mean : 2551.5
# 3rd Qu.:15354 3rd Qu.: 65.445 3rd Qu.:2.467e+09 3rd Qu.:1.410e+11 3rd Qu.: 1227.3
# Max. :20471 Max. :391.988 Max. :3.810e+11 Max. :2.960e+13 Max. :733660.0
# fluxmax mabs_b flag cstar
# Min. : 1.191 Min. :-24.25 Min. : 0.000 Min. :0.0001578
# 1st Qu.: 5.801 1st Qu.:-18.77 1st Qu.: 0.000 1st Qu.:3.0000000
# Median : 10.111 Median :-17.36 Median : 0.000 Median :3.0000000
# Mean : 39.649 Mean :-17.33 Mean : 1.217 Mean :2.5267219
# 3rd Qu.: 26.313 3rd Qu.:-15.94 3rd Qu.: 3.000 3rd Qu.:3.0000000
# Max. :6600.280 Max. :-10.72 Max. :51.000 Max. :3.0000000
Output of head(data,20)
:
subset_data = structure(list(id = 1:20, sigma = c(391.988, 379.985, 363.682,
358.969, 362.63, 344.544, 344.544, 331.482, 332.665, 302.539,
306.977, 287.416, 205.793, 303.279, 297.047, 273.719, 214.59,
268.891, 291.834, 191.926), L_gal = c(3.81e+11, 3.35e+11, 2.98e+11,
2.98e+11, 2.93e+11, 2.19e+11, 2.19e+11, 1.84e+11, 1.68e+11, 1.43e+11,
1.42e+11, 1.12e+11, 1.05e+11, 1.03e+11, 1.02e+11, 9.27e+10, 92017300000,
91078100000, 85536700000, 83359400000), M_gal = c(2.96e+13, 2.68e+13,
2.23e+13, 2.05e+13, 2.21e+13, 1.99e+13, 1.99e+13, 1.78e+13, 1.94e+13,
1.21e+13, 1.34e+13, 1.06e+13, 4.01e+12, 1.56e+13, 1.38e+13, 8.95e+12,
5.16e+12, 8.12e+12, 1.4e+13, 3.28e+12), flux = c(156286, 129987,
67801.2, 50110.3, 73118.6, 80827.2, 80827.2, 68568, 142348, 21194.6,
31081.9, 17414.4, 12121.3, 167441, 81709.3, 13920.7, 51775.8,
8185.93, 159998, 17393.7), fluxmax = c(6508.29, 4956.37, 2381.87,
2200.22, 2986.29, 2396.81, 2396.81, 2278.94, 4875.65, 854.856,
1264.36, 750.337, 19.7162, 6082.21, 724.639, 204.966, 281.601,
214.372, 6304.41, 182.002), mabs_b = c(-24.2475, -24.1079, -23.9807,
-23.9799, -23.9618, -23.6449, -23.6449, -23.4586, -23.3587, -23.1847,
-23.1745, -22.9178, -22.8463, -22.826, -22.8183, -22.7122, -22.7042,
-22.693, -22.6249, -22.5969), flag = c(35L, 0L, 0L, 0L, 3L, 2L,
2L, 2L, 3L, 2L, 0L, 2L, 35L, 2L, 3L, 35L, 2L, 2L, 0L, 2L), cstar = c(0.989659,
0.989581, 0.988048, 0.993796, 0.986398, 0.990529, 0.990529, 0.997505,
0.995231, 0.990121, 0.986176, 0.984495, 0.0007165, 0.987469,
0.0287568, 0.379966, 0.028632, 0.898742, 0.999391, 0.0286844)), .Names = c("id",
"sigma", "L_gal", "M_gal", "flux", "fluxmax", "mabs_b", "flag",
"cstar"), row.names = c(NA, 20L), class = "data.frame")
There are two clusters of points with a gap in between. The data does not contain an indicator. I assume you want to define, a priori, that points lying above the gap should be in one group and points lying below the gap should be in another group.
In that case, I'm afraid there is no getting around this without creating an indicator variable for the two groups yourself.
Luckily, this can be done using the locator() function. In this case, the function works because there's a clear gap between the two groups. It remains to use locator() to trace a line through the gap and check which sigma values lie above and below that traced line.
Once you have this indicator, you can use any fitting method you like...but that's a matter for a different post (perhaps on Cross Validated).
library('ggplot2')
d<-read.csv("SIS_TFFJ_all_sorted_R.csv")
uniq_sigma<-unique(d$sigma)
gap<-locator()
Here are the contents of gap from my trace:
> gap
$x
[1] -24.66446 -24.45990 -24.15305 -23.74391 -23.48820 -22.82336 -22.46536 -22.12442 -21.74938 -21.40843 -21.06749 -20.70950 -20.52198 -19.89123 -20.07875 -19.31162 -18.66382 -18.25469 -17.82851
[20] -17.07842 -16.39653 -15.64645 -15.08389 -14.24858 -13.44735 -12.40747 -11.19711
$y
[1] 346.67767 331.34710 315.20967 294.23100 277.28669 249.85305 229.68126 213.54382 194.17890 173.20024 159.48342 145.76660 136.08413 120.75357 123.98106 107.84362 92.51306 87.67183 79.60311
[20] 67.50003 58.62444 49.74885 44.10075 38.45265 30.38393 23.92896 17.47398
Now that I have an estimate of the line dividing the two groups, I can simply check which points are above and below the line.
d$x_pos<-cut(d$mabs_b, gap$x)
names(gap$y)<-unique(d$x_pos)
d$y_pos<-gap$y[d$x_pos]
d$cohort<-ifelse(d$sigma>d$y_pos,'upper','lower')
Finally, the plotting with model fitting using geom_smooth(). Again, what model you want to fit is a whole different question...one that is more suited for Cross Validated.
ggplot(data=d, aes(x=mabs_b, y=sigma, col=cohort, group=cohort))+geom_point()+geom_smooth(col='black')