Search code examples
rstringsplit

How can I split a string into character labels and numeric values in R?


Struggling newbie here, please pardon ungraceful explanations, and let me know what I need to clarify.

I have an object that like this, a vector of strings:

"aaaa,1,1,1,1,0"
"abba,0,0,1,1,1"
"bbaa,1,0,0,0,1"

I'd like to split out the four-letter labels as a character vector and the remainder I would like to reconstitute as numbers so that I'd end up with a dataframe like this, 3 obs of 6 variables, with the labels as character and the numbers as numeric:

aaaa 1 1 1 1 0
abba 0 0 1 1 1
bbaa 1 0 0 0 1

And then I want to add "column labels" to it and end up with

NAME 1 2 3 4 5
aaaa 1 1 1 1 0
abba 0 0 1 1 1
bbaa 1 0 0 0 1

and I'd also like a pony.

I feel like I have bits and pieces...I can split out the four letter labels using

substr(data,1,4) and that works to get a vector like

"aaaa", "abba", "bbaa".

But I cannot figure out what to use to get the rest of the string, the number part, back as a vector. substr(data,5,last) doesn't work and I prefer not to say substr(data,5,14) because despite my example here the strings won't always be 14 characters long. Is there a way to specify substr(data,5,"to the end of the string?")

Then, to convert the string to numbers I was trying

as.integer(unlist(strsplit(data,",")))

On the original file and I got back a single long vector with 1s and 0s but where the labels, the "aaba"s were replaced with NAs.

I'm stuck trying to put all the pieces together.

[Why do I have my numbers and labels mixed together in a string in the first place, you might ask? Because I wanted to replace all instances of "1,0,1" with "1,1,1" and using paste() to convert the numbers to strings and using gsub() on the strings to effect the replacement was the only way I could get that to work.]


Solution

  • You started on the right path. I had to add a few more steps, but I got the following to work:

    input = c(
        "aaaa,1,1,1,1,0",
        "abba,0,0,1,1,1",
        "bbaa,1,0,0,0,1"
    )
    
    df = type.convert(
        as.data.frame(
            matrix(unlist(strsplit(input, ',')), byrow = TRUE, nrow = length(input))),
        as.is = TRUE
    )