Search code examples
r

Data sub-setting using strings including sign greater than or less than


I would like to generate a full non-duplicate (row wise and column wise) combination of strings that contain instructions as greater than and less than (possibly adding other mathematical sign).

How can I do it? Please see below example including partial solution, which however is missing the ">" and "<" sign, so basically the variable name, here in this example named as a:e plus the sign for sub-setting in case variable is less or greater than 0.

The comb object includes the variables including the desired sign for sub-setting.

comb <- data.frame(in1=c("a > 0","b > 0","c > 0","d > 0","e > 0"),
                   in2=c("a < 0","b < 0","c < 0","d < 0","e < 0"))

comb.vars <- with(comb, expand.grid(in1,in2, stringsAsFactors=F))
comb.vars <- rbind(data.frame(data.frame(Var3="y > 0"),comb.vars),
                   data.frame(data.frame(Var3="y < 0"),comb.vars));
comb.vars

This does not give the desired outcome since in the same line it can occur the same variable shows opposing sign, example: y > 0 a > 0 a < 0 in first line and also line 7 gives y > 0 b > 0 b < 0

dup <- apply(comb.vars, 1, function(x) length(which(duplicated(x)))>0)
remdup1 <- comb.vars[!dup, ]

onlyvars <- apply(remdup1, 2, function(x) substr(x, 1, regexpr("\\>", x)-1))
# remove row-wise duplicats
dup <- apply(onlyvars, 1, function(x) length(which(duplicated(x)))>0)
remdup2 <- onlyvars[!dup, ]
# remove among rows duplicates
uniq <- remdup1[!duplicated(apply(remdup2, 1, function(row) paste(sort(row), collapse=""))), ] 
uniq

Base r solution required only.


Solution

  • You can find the number of times the first character is repeated across a row and then only keep rows where the values where the value does not duplicate.

    Using tidyverse:

    library(tidyverse)
    comb.vars %>% 
        rowwise() %>% 
        mutate(
            repvals = sum(duplicated(str_extract(c(Var1, Var2, Var3), "^\\w")))
        ) %>% 
        ungroup() %>% 
        filter(repvals == 0) %>% 
        select(-repvals)
    

    Returns:

    # A tibble: 40 × 3
       Var3  Var1  Var2 
       <chr> <chr> <chr>
     1 y > 0 b > 0 a < 0
     2 y > 0 c > 0 a < 0
     3 y > 0 d > 0 a < 0
     4 y > 0 e > 0 a < 0
     5 y > 0 a > 0 b < 0
     6 y > 0 c > 0 b < 0
     7 y > 0 d > 0 b < 0
     8 y > 0 e > 0 b < 0
     9 y > 0 a > 0 c < 0
    10 y > 0 b > 0 c < 0
    # ℹ 30 more rows
    

    A base R version to do the same:

    comb.vars$rep = apply(comb.vars, 1, function(x) {
            sum(duplicated(sapply(regmatches(x, gregexec("^\\w", x)), function(x) x[[1]])))
    })
    comb.vars <- comb.vars[comb.vars$rep == 0, ]