Search code examples
rmatchingcumulative-frequency

Draw nearest value from sorted data frame into unsorted data frame


I have two data frames in R. The first data frame is a cumulative frequency distribution (cumFreqDist) with associated periods. The first rows of the data frame look like this:

Time        cumfreq
0         0.0000000
4         0.9009009
6         1.8018018
8         7.5075075
12       23.4234234
16       39.6396396
18       53.4534535
20       58.2582583
24       75.3753754
100     100.0000000

The second data frame is 10000 draws from a runif distribution, using the code:

 testData <- (runif(10000))*100

For each row in testData, I want to locate the corresponding cumfreq in cumFreqDist and add the corresponding Time value into a new column in testData. Because testData is a test data frame standing in for a real data frame, I do not wish to sort testData.

Because I am dealing with cumulative frequencies, if the testData value is 23.30... the Time value that should be returned is 8. That is, I need to locate the nearest cumfreq value that does not exceed the testData value, and return only that one value.

The data.table package has been mentioned for other similar questions, but my limited understanding is that this package requires a key to be identified in both data frames (after conversion to data tables) and I can't assume that the testData values meet the requirements for being assigned as a key - and it appears that assigning a key will sort the data. This will cause me issues when I set a seed later in further work I am doing.


Solution

  • findInterval() is perfect for this:

    set.seed(1);
    cumFreqDist <- data.frame(Time=c(0,4,6,8,12,16,18,20,24,100), cumfreq=c(0.0000000,0.9009009,1.8018018,7.5075075,23.4234234,39.6396396,53.4534535,58.2582583,75.3753754,100.0000000) );
    testData <- data.frame(x=runif(10000)*100);
    testData$Time <- cumFreqDist$Time[findInterval(testData$x,cumFreqDist$cumfreq)];
    head(testData,20);
    ##            x Time
    ## 1  26.550866   12
    ## 2  37.212390   12
    ## 3  57.285336   18
    ## 4  90.820779   24
    ## 5  20.168193    8
    ## 6  89.838968   24
    ## 7  94.467527   24
    ## 8  66.079779   20
    ## 9  62.911404   20
    ## 10  6.178627    6
    ## 11 20.597457    8
    ## 12 17.655675    8
    ## 13 68.702285   20
    ## 14 38.410372   12
    ## 15 76.984142   24
    ## 16 49.769924   16
    ## 17 71.761851   20
    ## 18 99.190609   24
    ## 19 38.003518   12
    ## 20 77.744522   24