Search code examples
rfunctiondataframeminimax

Find highest threshold across 5 columns of dataframe to obtain n records


So I have a dataframe of 5 columns with 9031 observations. Each record of a column is a the percentile rank of all the values in the column.

My goal is to find the highest percentile threshold to apply to all five columns in order to obtain n records.

So for example let's say my goal is to reduce to the number of records in my dataframe to 1000 records, what is the highest threshold I need to apply to all 5 columns in order to obtain 1000 records?

  Variance_Five_Metrics$Zips    medium  medium.1  medium.2  medium.3   medium.4
  1                      00501 0.8395527 0.8671243 0.9894807 0.6567379 0.45875318
  2                      00544 0.8075518 0.8687853 0.9883734 0.6345920 0.45100210
  3                      01432 0.6124460 0.2460414 0.9701030 0.9103089 0.17107740
  4                      01434 0.7869560 0.3000775 0.9268077 0.8760935 0.11859152
  5                      01450 0.6642675 0.2451556 0.9919167 0.9585871 0.09600266

So let's say the above if the first five records (of 9031 records) of my dataframe. If I want to shrink my df to exactly 1000 records, what is the cutoff point I need to apply to all 5 columns in order to obtain exactly 1000 records where all 5 columns have values below this cutoff point?

I probably didn't explain this that well, but I'd appreciate any guidance

Thanks


Solution

  • If I understood the question right then you can achieve this with a simple row-wise minimum. Here is a line to obtain the cut-off value:

    quantile(apply(MAT, 1, min), p=1-(100/nrow(MAT)))
    

    You would need to replace "100" with your own number of wanted rows.


    Short explanation below

    So let's say you have a matrix of 5 columns:

    MAT <- matrix(rnorm(10000), ncol=5)
    

    You want to find a value that after cutoff on all columns together will leave you 100 rows:

    n <- 100
    

    In order to for whole row to be included all 5 of it's columns need to be above the cut-off value. This is the same as asking for the minimum value of that row to be above the cut-off.

    MIN <- apply(MAT, 1, min)
    

    Now we can find a percentile for cutoff.

    p <- 1 - (n/length(MIN))
    

    You want 100 values so the above is simply 1 - 100/1000 which is 0.9. So your cutoff will be >= 0.9 percentile.

    get the percentile:

    q <- quantile(MIN, probs=p)
    

    Did it work?

    sum(MAT[,1] > q & MAT[,2] > q & MAT[,3] > q & MAT[,4] > q & MAT[,5] > q)
    [1] 100