Search code examples
rstringintegertransform

convert string to unique integer form


I have a vector of strings, it is in my case that strings are logical rules. There are a lot of such rules, but I showed only three for clarity.

rules <- c("X[,1]>0.5 & X[,2]<1" , "X[,3]>0.2" , "X[,3]>0.3")

I would like to convert the rules to integer form, something like that

rules <- c("X[,1]>0.5 & X[,2]<1" , "X[,3]>0.2" , "X[,3]>0.3")
int <- rbind(c(0,0,2,5,0,1,0,0,1,0),c(1,2,0,0,0,0,0,0,0,0),c(1,1,0,0,0,0,0,0,0,0))

.

cbind.data.frame(rules,int)
                rules 1 2 3 4 5 6 7 8 9 10
1 X[,1]>0.5 & X[,2]<1 0 0 2 5 0 1 0 0 1  0
2           X[,3]>0.2 1 2 0 0 0 0 0 0 0  0
3           X[,3]>0.3 1 1 0 0 0 0 0 0 0  0

There are three conditions

  1. all int vectors must be the same length

  2. If the rule(string) is similar to another string, then the intvectors should be similar too. This is necessary in order to be able to calculate the distance between strings or intvectors. enter image description here

  3. the ability to convert string to int form, as well as back int form to string

Is such a conversion possible?


Solution

  • If all the rules are similar to the ones you showed, one way to do would be to generate a standard X matrix, parse each of the rules and apply them to X. That will generate vectors of TRUE and FALSE (which are easily converted to 1 and 0) with length nrow(X).

    For example,

    set.seed(123)
    X <- matrix(runif(3000, 0, 2), nrow = 1000)
    rules <- c("X[,1]>0.5 & X[,2]<1" , "X[,3]>0.2" , "X[,3]>0.3")
    int <- matrix(NA, nrow = length(rules), ncol = nrow(X))
    for (i in seq_along(rules)) 
      int[i,] <- as.numeric(eval(parse(text = rules[i])))
    rownames(int) <- rules
    
    dist <- matrix(NA, length(rules), length(rules),
                   dimnames = list(rules, rules))
    for (i in seq_along(rules)) 
      for (j in seq_along(rules)) 
        dist[i, j] <- sqrt(sum((int[i,] - int[j,])^2))
    
    dist
    #>                     X[,1]>0.5 & X[,2]<1 X[,3]>0.2 X[,3]>0.3
    #> X[,1]>0.5 & X[,2]<1             0.00000  24.67793  24.28992
    #> X[,3]>0.2                      24.67793   0.00000   7.28011
    #> X[,3]>0.3                      24.28992   7.28011   0.00000
    

    Created on 2021-08-29 by the reprex package (v2.0.0)