Search code examples
rregexqregularexpression

How to split decimal numbers followed by letters?


I have date like the following

A <- c("-0.00023--0.00243unitincrease", "-0.00176-0.02176pmol/Lincrease(replication)",
       "0.00180-0.01780%varianceunitdecrease")

I want to extract the digits part and the rest part as two columns B and C. after extraction, it should get the following data frame:

#                                           A                 B                           C
#               -0.00023--0.00243unitincrease -0.00023--0.00243                unitincrease
# -0.00176-0.02176pmol/Lincrease(replication)  -0.00176-0.02176 pmol/Lincrease(replication)
#        0.00180-0.01780%varianceunitdecrease   0.00180-0.01780       %varianceunitdecrease

how to get that result in R?


Solution

  • Using strsplit with positive lookahead/lookbehind. The [a-z%] denotes the range of letters from a to z as well as the % sign and should be expanded if there are other possibilities.

    r1 <- do.call(rbind, strsplit(A, "(?<=\\d)(?=[a-z%])", perl=TRUE))
    res1 <- setNames(as.data.frame(cbind(A, r1)), LETTERS[1:3])
    res1
    #                                             A                 B                           C
    # 1               -0.00023--0.00243unitincrease -0.00023--0.00243                unitincrease
    # 2 -0.00176-0.02176pmol/Lincrease(replication)  -0.00176-0.02176 pmol/Lincrease(replication)
    # 3        0.00180-0.01780%varianceunitdecrease   0.00180-0.01780       %varianceunitdecrease
    

    You may also want to get the numbers,

    res2 <- type.convert(as.data.frame(
      do.call(rbind, strsplit(A, "(?<=\\d)-|(?<=\\d)(?=[a-z%])", perl=TRUE))))
    res2
    #         V1       V2                          V3
    # 1 -0.00023 -0.00243                unitincrease
    # 2 -0.00176  0.02176 pmol/Lincrease(replication)
    # 3  0.00180  0.01780       %varianceunitdecrease
    

    where:

    str(res2)
    # 'data.frame': 3 obs. of  3 variables:
    # $ V1: num  -0.00023 -0.00176 0.0018
    # $ V2: num  -0.00243 0.02176 0.0178
    # $ V3: Factor w/ 3 levels "%varianceunitdecrease",..: 3 2 1