Search code examples
rdatasetextractmining

How to extract specific intervals of the dataset?


I have a big dataset (More than 200 thousand lines and a thousand of trials) with four columns with follow information:

  • "Trials": a sequential number)
  • "Is.Check": a column with 1 for checks and 0 for non-checks
  • "ID": a column with the name of the check or line (non-checks)
  • "Reaction": the response variable

I need to know which Lines (Identified with "L" + Number") stay between the biggest and smaller value of checks, in each trial. For example, in trial 10001, I have 5 checks, the biggest is Check 3 (Reaction = 56) and the smaller Check 1 (Reaction = 50). I need to know which lines stay between these two values i.e. (bigger or equal 50 and small or equal 56). A small dataset was attached for an example. In this, in trial 10001 all lines should be select, except "L3". On the other hand, in trial 10002 the smaller Check is Check-1 (50) and the biggest is Check-7 (60), thus, only L3 and L9 should be selected. The Checks and the Reaction Values change in each trial, thus, I need to extract the lines with the values between the biggest and smaller checks in each Trial.

Trials  Is.Check     ID       Reaction
10001   1            Check-1  50
10001   0            L1       50
10001   0            L2       50
10001   0            L10      50
10001   0            L9       50
10001   0            L6       50
10001   0            L3       48
10001   0            L4       50
10001   0            L8       50
10001   1            Check-5  52
10001   0            L7       50
10001   1            Check-2  52
10001   1            Check-4  54
10001   0            L5       52
10001   1            Check-3  56
10002   1            Check-1  50
10002   0            L1       48
10002   0            L2       48
10002   0            L3       54
10002   0            L4       64
10002   0            L5       64
10002   0            L6       62
10002   0            L7       62
10002   0            L8       70
10002   0            L9       52
10002   1            Check-7  60
10002   1            Check-2  54
10002   1            Check-6  56
10002   1            Check-3  54

Solution

  • As a first step, you want to split the data.frame, per Trial:

    splitted <- split(dataset, dataset$Trials)
    

    Then, for each of those elements, we select all non-check elements, that obey 2 conditions: >= the smallest check, and <= the largest one. And we need to do this for each of the sub-data.frames in splitted, so we use lapply, to apply our function to each of the sub-data.frames:

    selection <- lapply(splitted, function(subdf) {
      subdf[subdf$Is.Check==0 &
            subdf$Reaction >= min(subdf$Reaction[subdf$Is.Check==1]) &
            subdf$Reaction <= max(subdf$Reaction[subdf$Is.Check==1])
           ,]
    

    })

    Selection is now a list with for each trial a selection of the right non-checks. If you want to recombine those selections into one large data.frame you can use bind_rows from the dplyr-package

    install.packages('dplyr') # If you don't have it yet
    Fullselection <- dplyr::bind_rows(selection)
    

    If for some reason you can't install dplyr, do.call(rbind, selection) also works (but a bit slower, and it's uglier code.)