Search code examples
rstrsplit

How to read a comma-separated numerical string and perform various functions on it


I have a column with numerical comma-separated strings, e.g., '0,1,17,200,6,0,1'. I want to create new columns for the sums of those numbers (or substrings) in the strings that are not equal to 0.

I can use something like this to count the sum of non-zero numbers for the whole string:

df$F1 <- sapply(strsplit(df1$a, ","), function(x) length(which(x>0)))
[1] 5

This outputs '5' as the number of substrings in for the example string above, which is correct as the number of substrings in '0,1,17,200,6,0,1' is indeed 5.

The challenge, however, is to be able to restrict the number of substrings. For example, how can I get the the count for only the first 3 or 6 substrings in the string?


Solution

  • You can use gsub and backreference to cut the string to the desired length before you count how many substrings are > 0:

    DATA:

    df1 <- data.frame(a = "0,1,17,200,6,0,1")
    df1$a <- as.character(df1$a)
    

    SOLUTION:

    First cut the string to whatever number of substrings you want--here, I'm cutting it to three numeric characters (the first two of which are followed by a comma)--and store the result in a new vector:

    df1$a_3 <- gsub("^(\\d+,\\d+,\\d+)(.*)", "\\1", df1$a)
    df1$a_3
    [1] "0,1,17"
    

    Now insert the new vector into your sapply statement to count how many substrings are greater than 0:

    sapply(strsplit(df1$a_3, ","), function(x) length(which(x>0)))
    [1] 2
    

    To vary the number of substrings, vary the number of repetitions of \\d+ in the pattern accordingly. For example, this works for 6 substrings:

    df1$a_6 <- gsub("^(\\d+,\\d+,\\d+,\\d+,\\d+,\\d+)(.*)", "\\1", df1$a)
    sapply(strsplit(df1$a_6, ","), function(x) length(which(x>0)))
    [1] 4
    

    EDIT TO ACCOUNT FOR NEW SET OF QUESTIONS:

    To compute the maximum value of substrings > 0, exemplified here for df1$a, the string as a whole (for the restricted strings, just use the relevant vector accordingly, e.g., df1$a_3, df1$a_6 etc.): First split the string using strsplit, then unlist the resulting list using unlist, and finally convert the resulting vector from character to numeric, storing the result in a vector, e.g., string_a:

    string_a <- as.numeric(unlist(strsplit(df1$a, ",")))
    string_a
    [1]   0   1  17 200   6   0   1
    

    On that vector you can perform all sorts of functions, including max for the maximum value, and sum for the sum of the values:

    max(string_a)
    [1] 200
    
    sum(string_a)
    [1] 225
    

    Re the number of values that are equal to 0, adjust your sapply statement by setting x == 0:

    sapply(strsplit(df1$a, ","), function(x) length(which(x == 0)))
    [1] 2
    

    Hope this helps!