Search code examples
rstringtextdummy-variablesplitstackshape

splitstackshape pkg - concat.split.expanded returning NA by coercion errors


I'm following the instructions here Dummy variables from a string variable to try to convert a column of strings (words separated by spaces) into dummy variables (0-1 to indicate a word being notused/used in the string in that row) using concat.split.expanded but get a bunch of the below error:

In lapply(listOfValues, as.integer) : NAs introduced by coercion

preceded by one of

Error in seq.default(min(vec), max(vec)) : 'from' cannot be NA, NaN or infinite

I'm pretty sure there aren't any NAs in the column to be converted, let alone that many. Not sure how to go about fixing this. Thanks!

command I've been running that produces the problem:

concat.split.expanded(dataset, "stringvarname", sep = " ", mode = "binary", drop = false)

Produces the problem with or without fill=


Solution

  • You need to specify that you are splitting concatenated strings ("var2" in the sample data below) and not numeric values concatenated as strings ("var3" in the sample data below).

    Here's an example that reproduces your error and shows the working solution:

    df = data.frame(var1 = 1:2, var2 = c("a b c", "a c d"), var3 = c("1 2 3", "1 2 5"))
    library(splitstackshape)
    
    cSplit_e(df, "var3", sep = " ")
    #   var1  var2  var3 var3_1 var3_2 var3_3 var3_4 var3_5
    # 1    1 a b c 1 2 3      1      1      1     NA     NA
    # 2    2 a c d 1 2 5      1      1     NA     NA      1
    
    ## Will give you an error
    cSplit_e(df, "var2", sep = " ")
    #  Error in seq.default(min(vec), max(vec)) : 
    #   'from' cannot be NA, NaN or infinite In addition: Warning messages:
    # 1: In lapply(listOfValues, as.integer) : NAs introduced by coercion
    # 2: In lapply(listOfValues, as.integer) : NAs introduced by coercion
    
    cSplit_e(df, "var2", sep = " ", type = "character")
    #   var1  var2  var3 var2_a var2_b var2_c var2_d
    # 1    1 a b c 1 2 3      1      1      1     NA
    # 2    2 a c d 1 2 5      1     NA      1      1
    

    Why? cSplit_e uses seq, and seq is for numeric input.

    > seq("a", "c")
    Error in seq.default("a", "c") : 'from' cannot be NA, NaN or infinite