Search code examples
rstringdataframeextractalphanumeric

Extract all words from a string and create a column with the result


I have a data frame (data3) with a Column named "Collector". In this column i have alpha numeric characters. For example: "Ruiz and Galvis 650". I need to extract the alpha characters and the numeric characters separately, and create two new columns, one with the numbers of that string (ColID) and another one with all the words (Col):

INPUT:

Collector                       Times     Sample
Ruiz and Galvis 650             9         SP.1              
Smith et al 469                 8         SP.1

EXPECTED OUTPUT

Collector                       Times     Sample     ColID    Col
Ruiz and Galvis 650             9         SP.1        650     Ruiz and Galvis
Smith et al 469                 8         SP.1        469     Smith et al

I have tried the following but when I try to save the file I get an error (Error in .External2(C_writetable, x, file, nrow(x), p, rnames, sep, eol, : unimplemented type 'list' in 'EncodeElement'):

regexp <- "[[:digit:]]+"
data3$colID<- NA
data3$colID <- str_extract (data3$Collector, regexp)

data3$Col<- NA
regexp <-"[[:alpha:]]+"
data3$Col <- (str_extract_all (data3$Collector, regexp))
write.table(data3, file = paste("borrar2",".csv", sep=""), quote=T, sep = ",", row.names = F)

Solution

  • The problem is that str_extract_all doesn't find just a single string, but a list of multiple. For example:

    > dput(str_extract_all("Ruiz and Galvis 650", "[[:alpha:]]+"))
    list(c("Ruiz", "and", "Galvis"))
    

    A data frame with nested elements (as above) apparently cannot be saved to a file.

    If, however, you update the regex pattern to match spaces as well as letters, you can go back to using str_extract instead:

    > dput(str_extract("Ruiz and Galvis 650", "[[:alpha:] ]+"))
    "Ruiz and Galvis "
    

    Note the space in the second regex. This matches all the letters/spaces as one string and will allow you write the data.frame to a file.