Search code examples
rfor-loopif-statementgroupingbinning

In R, how do I classify each row of a data frame based on the bin its values fall into?


In R, I want to classify each rows of the data frame by binning the values and using the number (sum) of values in each bin to assign them into 2 groups (classes) by using if-else logic.

  • Within an R for-loop, I used the R cut and split commands to bin the values by row.
  • The bins (ranges) are: 1..9, 10..19, 20..29, 30..39, 40..49.
  • If a row contains 1 pair of values falling in the same bin (range), say 10..19, then it should be classified as "P". If it contains 2 pairs falling into 2 different bins (ranges), then they should be classified as "PP".
  • Then I created 2 new variables named p and pp by using hard-coded conditions/rules. The values in the variables are either TRUE or FALSE, depending whether the n-th row meet the those rules.
  • Finally, I used p and pp as conditions in the if-else statement to assign each row to either class P (1st row), or class PP (2nd row).

First, I created a data frame x:

n1 <- c(1, 7); n2 <- c(2, 11); n3 <- c(10, 14); n4 <- c(23, 32); n5 <- c(37, 37); n6 <- c(45, 41)
x <- data.frame(n1, n2, n3, n4, n5, n6)
x
  n1 n2 n3 n4 n5 n6
1  1  2 10 23 37 45
2  7 11 14 32 37 41

The 1st row should be classified as "P", because it has 1 pair of values (1, 2) falling in the same bin 1..10.
The 2nd row should be classified as "PP", because it has 2 pairs of values (11, 14 and 32, 37) falling in 2 bins: 10..19 and 30..39, accordingly.

So, after creating the data frame x, I created a for-loop:

for(i in nrow(x)){

# binning the data:
  bins <- split(as.numeric(x[i, ]), cut(as.numeric(x[i, ]), c(0, 9, 19, 29, 39, 49)))
  # creating the rule for p (1 pair of numbers falling in the same range)
  p <- (sum(lengths(bins) == 2) == 1 & sum(lengths(bins) == 1) == 4)
  # creating the rule for pp (2 different pairs, each has 2 numbers falling in the same range)
  pp <- (sum(lengths(bins) == 2) == 2 & sum(lengths(bins) == 1) == 2 & sum(lengths(bins) == 0) == 1)

  if(p){
    x$types <- "P"
  } else if(pp){
    x$types <- "PP"
  } else{
    stop("error")
  }
  }

print(x)

I want to create a new column named types, holding the class P or PP:

  n1 n2 n3 n4 n5 n6 types
1  1  2 10 23 37 45 P
2  7 11 14 32 37 41 PP

Instead the code returned only PP:

  n1 n2 n3 n4 n5 n6 types
1  1  2 10 23 37 45 PP
2  7 11 14 32 37 41 PP

This is because the loop runs twice over the rows. But if it runs only once, all the rows are classified as "P", instead of "PP". I expect it's something very simple, just was not able to figure it out so far.


Solution

  • The error in your for loop is that you don't use i when you assign type. x$types <- "P" assigns the entire types column to be "P". x$types <- "PP" assigns the whole types column to be "PP". So, whatever the last result is, that will be the value for your entire column.

    Also, using the full row x[i, ] is dangerous after you add the types column. Presumably you don't want to try to convert the "P" and "PP" values of types to numeric and bin them. I would suggest making types a separate vector, and only adding it as a column after the loop. Before the loop: types <- chracter(nrow(x)). Inside the loop: types[i] <- instead of x$types <-. After the loop, x$types <- types.

    You are also making the classic syntax error of for (i in nrow(x)) when you mean for (i in 1:nrow(x)).

    Fixing all of these:

    n1 <- c(1, 7); n2 <- c(2, 11); n3 <- c(10, 14); n4 <- c(23, 32); n5 <- c(37, 37); n6 <- c(45, 41)
    x <- data.frame(n1, n2, n3, n4, n5, n6)
    
    types <- character(nrow(x))
    
    for(i in 1:nrow(x)){
      # binning the data:
      bins <- split(as.numeric(x[i, ]), cut(as.numeric(x[i, ]), c(0, 9, 19, 29, 39, 49)))
      # creating the rule for p (1 pair of numbers falling in the same range)
      p <- (sum(lengths(bins) == 2) == 1 & sum(lengths(bins) == 1) == 4)
      # creating the rule for pp (2 different pairs, each has 2 numbers falling in the same range)
      pp <- (sum(lengths(bins) == 2) == 2 & sum(lengths(bins) == 1) == 2 & sum(lengths(bins) == 0) == 1)
    
      if(p){
        types[i] <- "P"
      } else if(pp){
        types[i] <- "PP"
      } else{
        stop("error")
      }
    }
    
    x$types <- types
    x
    #   n1 n2 n3 n4 n5 n6 types
    # 1  1  2 10 23 37 45     P
    # 2  7 11 14 32 37 41    PP