So I have a dataframe of 5 columns with 9031 observations. Each record of a column is a the percentile rank of all the values in the column.
My goal is to find the highest percentile threshold to apply to all five columns in order to obtain n records.
So for example let's say my goal is to reduce to the number of records in my dataframe to 1000 records, what is the highest threshold I need to apply to all 5 columns in order to obtain 1000 records?
Variance_Five_Metrics$Zips medium medium.1 medium.2 medium.3 medium.4
1 00501 0.8395527 0.8671243 0.9894807 0.6567379 0.45875318
2 00544 0.8075518 0.8687853 0.9883734 0.6345920 0.45100210
3 01432 0.6124460 0.2460414 0.9701030 0.9103089 0.17107740
4 01434 0.7869560 0.3000775 0.9268077 0.8760935 0.11859152
5 01450 0.6642675 0.2451556 0.9919167 0.9585871 0.09600266
So let's say the above if the first five records (of 9031 records) of my dataframe. If I want to shrink my df to exactly 1000 records, what is the cutoff point I need to apply to all 5 columns in order to obtain exactly 1000 records where all 5 columns have values below this cutoff point?
I probably didn't explain this that well, but I'd appreciate any guidance
Thanks
If I understood the question right then you can achieve this with a simple row-wise minimum. Here is a line to obtain the cut-off value:
quantile(apply(MAT, 1, min), p=1-(100/nrow(MAT)))
You would need to replace "100" with your own number of wanted rows.
Short explanation below
So let's say you have a matrix of 5 columns:
MAT <- matrix(rnorm(10000), ncol=5)
You want to find a value that after cutoff on all columns together will leave you 100 rows:
n <- 100
In order to for whole row to be included all 5 of it's columns need to be above the cut-off value. This is the same as asking for the minimum value of that row to be above the cut-off.
MIN <- apply(MAT, 1, min)
Now we can find a percentile for cutoff.
p <- 1 - (n/length(MIN))
You want 100 values so the above is simply 1 - 100/1000 which is 0.9. So your cutoff will be >= 0.9 percentile.
get the percentile:
q <- quantile(MIN, probs=p)
Did it work?
sum(MAT[,1] > q & MAT[,2] > q & MAT[,3] > q & MAT[,4] > q & MAT[,5] > q)
[1] 100