I have two vectors of correlations: one which represents real correlations, and the other permuted correlations (the null distribution). I want to find the correlation value that corresponds to a FDR of 0.05.
Updated approach:
cor_real=rnorm(1000,0,sd=0.2)
cor_null=rnorm(1000,0,sd=0.15)
d_real=density(cor_real,from=max(min(cor_real),min(cor_null)),to=min(max(cor_real),max(cor_null)))
d_null=density(cor_null,from=max(min(cor_real),min(cor_null)),to=min(max(cor_real),max(cor_null)))
# here we ensure that the x values are comparable between the two densities
plot(d_real)
lines(d_null)
Then, to find the correlation value that corresponds to FDR = 0.05, my guess would be:
ratios=d_null$y/d_real$y
d_real$x[which(round(ratios,2)==.05)]
[1] 0.5694628 0.5716372 0.5868581 0.5890325 0.5912069
# this the the correlation value(s) that corresponds to a 5% chance of a false positive
Is this the right approach?
E.g.:
cor_real=rnorm(100,0.25,sd=0.1)
cor_null=rnorm(100,0.2,sd=0.1)
h_real=hist(cor_real,plot=F)
h_null=hist(cor_null,plot=F)
plot(h_null,col=rgb(1,0,0,.5),xlim=c(0,1),ylim=c(0,max(h_real$counts))) # in red
plot(h_real,col=rgb(0,.5,.5,0.25),add=T) # in blue
I think this is when the ratio of the frequencies of the two histograms = 0.05 (null:real), but I'm not 100% sure about that.
How might I find the correlation value that corresponds to FDR = 0.05, having "access" to both the null and real distributions?
Density is not quite correct because 1. you did not set n
and from, to
to be the same, 2. it calculates the number of false positive / number of false negative at only 1 bin.
False discovery rate is defined as FP / (FP + TP). See this post too. We can calculate this once we put the two correlations in the same vector, label and order them:
set.seed(321)
cor_real=rnorm(1000,0,sd=0.2)
cor_null=rnorm(1000,0,sd=0.15)
df = data.frame(rho = c(cor_real,cor_null),
type = rep(c(TRUE,FALSE),each=1000))
df$rho = abs(df$rho)
df = df[order(df$rho,decreasing=TRUE),]
df$FP = cumsum(df$type == FALSE)
df$TP = cumsum(df$type == TRUE)
df$FDR = df$FP / (df$FP + df$TP)
If you look at the results,
head(df,25)
rho type FP TP FDR
366 0.5822139 TRUE 0 1 0.00000000
247 0.5632078 TRUE 0 2 0.00000000
298 0.5594879 TRUE 0 3 0.00000000
147 0.5460875 TRUE 0 4 0.00000000
781 0.5373146 TRUE 0 5 0.00000000
760 0.5367116 TRUE 0 6 0.00000000
797 0.5216281 TRUE 0 7 0.00000000
569 0.5204598 TRUE 0 8 0.00000000
374 0.5200687 TRUE 0 9 0.00000000
744 0.5101275 TRUE 0 10 0.00000000
864 0.5058457 TRUE 0 11 0.00000000
227 0.4997959 TRUE 0 12 0.00000000
66 0.4993164 TRUE 0 13 0.00000000
14 0.4886520 TRUE 0 14 0.00000000
830 0.4840573 TRUE 0 15 0.00000000
261 0.4765394 TRUE 0 16 0.00000000
1163 0.4703764 FALSE 1 16 0.05882353
27 0.4661862 TRUE 1 17 0.05555556
965 0.4633883 TRUE 1 18 0.05263158
530 0.4608271 TRUE 1 19 0.05000000
96 0.4575683 TRUE 1 20 0.04761905
851 0.4563224 TRUE 1 21 0.04545455
922 0.4516161 TRUE 1 22 0.04347826
343 0.4511517 TRUE 1 23 0.04166667
At abs(rho) >= 0.4511517, you have 1 FP and 23 TP, giving you an FDR of 0.0416.. which is below the FDR of 0.05. So you can set your absolute cutoff here.
The example you have is pretty hard to test because both are almost the same null hypothesis with only a different sd. In real life most likely we would need to simulate data to find the correlation we obtain under the null hypothesis. And you will see that the calculation above should work pretty well.