Search code examples
rpython-3.xcategoriesdata-analysisdata-cleaning

How to detect when conditions happen 3 times in a row having priorities?


I have created the next minimal, reproducible example to simulate my problem.

I have a list of tibbles, 50 in this example, and I would like to classify them in four different categories.

To classify them I want to priorize the categories from 4 downto 1.

If the value 4 appear but it doesn't do it for 3 times in a row I want to see if the next time that the value 4 appears it happens. If all along the sequence it does not happen, then I want to do the same for value 3 and so on.

The problem with my code is that when the first value 4 doesn't have the lenght expected then checks the value 3, and if it is true I will not get the chance to classify as four a tibble that maybe has other values 4 with the lenght expected in further indexes.

I have used the rle() function to get the values and the number of times they appear consecutively.

I know that a bucles of for is not the best solution and probably there are easier ways to solve this problem without them and without rle(). A solution using python will be helpful as well!

valueA=replicate(50, tibble(floor(runif(800,min=1, max=5))))
valueB=list(list())
for (i in seq_along(valueA)){
  valueB[[i]]=rle(valueA[[i]]) 
}

cat=""
for (i in seq_along(valueB)){
  for (j in seq(valueB[[i]][[1]])){
    if (valueB[[i]]$values[j] == 4){
      if (valueB[[i]]$lengths[j] > 3){
        cat[i] = "four"
      }
    } else if (valueB[[i]]$values[j] == 3){
      if (valueB[[i]]$lengths[j] > 3){
        cat[i] = "three"
      }
    }else if (valueB[[i]]$values[j] == 2){
      if (valueB[[i]]$lengths[j] > 3){
        cat[i] = "two"
      }
    } else if (valueB[[i]]$values[j] == 1){
      if (valueB[[i]]$lengths[j] > 3){
        cat[i] = "one"
      }
    }
  }
}

To clarify the problem I show the results I got in this case:

> cat
 [1] "two"   "two"   "one"   "three" "one"   "three" "two"   "two"   "four"  "two"   "three" "three" "three" "two"   "three"
[16] "four"  "two"   "four"  "four"  "two"   "two"   "four"  "three" "two"   "three" "two"   "two"   "one"   "three" "four" 
[31] "four"  "three" "one"   "one"   "three" "one"   "two"   "one"   "four"  "two"   "one"   "four"  "one"   "two"   "two"  
[46] "three" "three" "three" "four"  "three"

For the first tibble it says that is a category two but checking the tibble:

valueA[[1]]
  [1] 4 2 3 1 3 1 2 4 4 3 3 2 2 3 4 3 3 3 1 3 4 4 4 2 1 2 4 1 1 1 2 1 4 4 3 3 4 3 1 3 4 2 4 2 1 2 4 1 2 4 2 1 1 2 4 1 1 4 2 3 3
 [62] 2 3 1 2 3 1 4 3 2 1 3 1 4 2 3 3 2 3 1 1 3 4 2 3 1 1 1 4 4 1 4 2 4 4 1 4 1 1 4 1 4 3 3 4 2 4 2 1 1 2 1 4 1 3 1 3 2 3 2 4 2
[123] 3 2 4 1 4 3 1 2 3 2 1 2 3 1 4 4 2 1 4 4 1 3 1 4 1 4 3 2 1 3 4 4 1 2 2 1 1 1 1 3 1 3 2 3 2 2 1 3 2 1 1 2 3 4 2 3 4 2 1 3 2
[184] 4 2 1 1 1 2 1 3 3 2 3 2 2 1 1 1 1 3 1 1 2 4 1 4 1 4 2 3 2 1 2 3 3 2 4 3 2 3 1 3 3 2 1 3 3 2 4 4 4 4 2 3 2 2 2 2 3 4 3 2 3
[245] 3 3 1 4 4 1 4 4 2 2 3 2 2 2 2 1 4 1 2 2 3 3 1 1 4 2 2 3 1 3 1 3 2 2 1 3 4 1 2 3 3 1 1 1 2 3 1 3 4 4 4 2 4 3 2 2 3 4 4 1 3
[306] 1 2 3 3 3 3 4 1 1 3 2 3 2 4 1 2 1 4 1 1 2 2 4 3 3 1 1 3 3 4 2 3 4 2 1 3 4 2 3 3 1 2 1 4 2 3 2 1 2 3 3 1 4 2 1 2 2 1 2 3 1
[367] 4 1 3 1 2 2 1 3 1 1 2 3 1 4 3 3 1 1 3 1 1 3 4 3 4 4 3 3 4 1 2 1 3 2 4 3 1 2 4 4 4 1 3 2 3 2 2 3 3 3 2 4 4 4 3 3 2 3 3 2 1
[428] 3 3 1 2 2 3 2 2 3 4 3 3 4 2 3 4 3 1 2 2 3 3 3 4 2 3 3 3 1 4 3 4 3 2 2 4 4 3 4 2 2 1 3 4 2 1 2 3 2 1 4 1 3 2 2 4 4 3 2 2 4
[489] 3 3 4 3 3 4 3 2 4 4 1 3 4 4 1 1 2 2 4 4 4 4 4 2 4 2 3 2 3 3 4 3 2 4 4 3 4 3 4 2 2 3 3 2 4 3 4 2 1 4 1 4 2 1 1 1 4 1 4 4 3
[550] 4 2 4 1 4 1 1 1 3 2 4 1 3 1 3 3 4 1 2 3 2 1 1 3 4 2 2 3 4 4 1 3 3 2 4 4 4 2 1 2 2 2 4 1 1 1 2 3 1 2 1 3 1 3 4 2 4 4 3 3 4
[611] 2 1 2 2 3 2 2 1 4 4 4 4 4 3 2 3 4 2 4 1 2 1 3 1 2 3 1 2 4 3 1 4 3 4 2 3 3 3 2 3 4 2 4 2 3 3 1 2 1 2 3 4 3 2 2 3 4 1 4 3 2
[672] 1 2 3 4 3 1 1 1 2 2 3 3 3 3 2 2 3 1 1 4 4 3 3 1 1 4 1 1 4 3 1 3 1 2 1 2 2 2 1 3 3 3 1 3 2 1 4 1 1 3 3 1 4 2 2 3 4 4 3 4 2
[733] 4 1 3 2 1 1 4 2 2 3 3 4 1 2 3 1 2 2 2 1 2 2 2 4 1 2 1 2 3 3 4 2 1 1 3 2 3 2 2 4 1 4 1 4 4 1 1 1 3 2 4 1 2 4 2 2 2 2 3 4 4
[794] 4 1 4 2 1 3 3

I can see starting at 619 value more than three fours in a row, so the real category of my first tibble have to be four.


Solution

  • I finally found a solution even though is not elegant at all.

    valueA=replicate(50, tibble(floor(runif(800,min=1, max=5))))
    valueB=list(list())
    for (i in seq_along(valueA)){
      valueB[[i]]=rle(valueA[[i]]) 
    }
    
    cat=NA
    for (i in seq_along(valueB)){
      for (j in seq(valueB[[i]][[1]])){
        if ((valueB[[i]]$values[j] == 4)&(valueB[[i]]$lengths[j] > 3)){
            cat[i] = 4
            break
        }
      }
    }
    
    for (i in seq_along(valueB)){
      for (j in seq(valueB[[i]][[1]])){
        if ((valueB[[i]]$values[j] == 3)&(valueB[[i]]$lengths[j] > 3)&(is.na(cat[i]))){
            cat[i] = 3
            break
        }
      }
    }
    
    for (i in seq_along(valueB)){
      for (j in seq(valueB[[i]][[1]])){
        if ((valueB[[i]]$values[j] == 2)&(valueB[[i]]$lengths[j] > 3)&(is.na(cat[i]))){
            cat[i] = 2
            break
        }
      }
    }
    
    for (i in seq_along(valueB)){
      for (j in seq(valueB[[i]][[1]])){
        if ((valueB[[i]]$values[j] == 1)&(valueB[[i]]$lengths[j] > 3)&(is.na(cat[i]))){
            cat[i] = 1
            break
        }
      }
    }
    

    Instead of using else if inside the for loops, I use 4 differents for loops, adding a break at the end of their if condition. By doing it I save time reading the whole loop if the condition has been already met. The cat value is filled with NA's, so once one category with more priority has been detected this tibble will not be evaluated again.

    I had to use the rle() function as originally posted.