Search code examples
rtextdplyr

Extract number and following text and create multiple new columns in R


I have free-text data with numerous references to specific questions, and I'd like to organize it as shown below.

I'm able to create columns that note mentions of a certain topic (if the respondent references it by number), but I'd like a way to extract all of the text following the number, until another number is encountered.

Thanks in advance for any help!

library(tidyverse, warn.conflicts = F)

# Data
df <- data.frame(comment = c("topic 1: this is fine. 4 this is fine too. #9 not so good", "1 ok this is fine. 17 i do not like this idea. 25 great idea 43 cool idea"))

# I can identify the mentions if a respondent specifies the number they are responding to
df <- df %>% 
  mutate(mention = map(str_extract_all(comment, "[0-9]+"), as.numeric)) %>% 
  unnest_wider(col = mention, names_sep = "_")

# Ideal output
df_ideal <- structure(list(comment = c("topic 1: this is fine. 4 this is fine too. #9 not so good", 
"1 ok this is fine. 17 i do not like this idea. 25 great idea 43 cool idea"
), mention_1 = c(1, 1), mention_2 = c(4, 17), mention_3 = c(9, 
25), mention_4 = c(NA, 43), comment_1 = c("1: this is fine.", 
"1 ok this is fine."), comment_2 = c("4 this is fine too.", "17 i do not like this idea."
), comment_3 = c("9 not so good", "25 great idea"), comment_4 = c(NA, 
"42 nice idea")), row.names = c(NA, -2L), class = c("tbl_df", 
"tbl", "data.frame"))

Created on 2021-04-18 by the reprex package (v2.0.0)


Solution

  • An option with strsplit to split at one or more space (\\s+) that follows a dot (\\. - . - metacharacters are escaped), and precedes a digit or # (regex lookaround), then we loop over the output list with lapply, remove any characters that are not digits (\\D+) from the start (^) of each of the string with sub, rbind the list elements and assign the 'comment_' columns into the original dataset 'df'

    df[paste0('comment_', 1:3)] <- do.call(rbind, lapply(strsplit(df$comment, 
          "(?<=\\.)\\s+(?=[0-9#])", perl = TRUE), function(x) sub("^\\D+", "", x)))
    

    -output

    df
    # A tibble: 2 x 7
      comment                                                      mention_1 mention_2 mention_3 comment_1          comment_2                   comment_3    
      <chr>                                                            <dbl>     <dbl>     <dbl> <chr>              <chr>                       <chr>        
    1 topic 1: this is fine. 4 this is fine too. #9 not so good            1         4         9 1: this is fine.   4 this is fine too.         9 not so good
    2 1 ok this is fine. 17 i do not like this idea. 25 great idea         1        17        25 1 ok this is fine. 17 i do not like this idea. 25 great idea
    

    Update

    If the length differ (as in the updated example), we can pad NA at the end based on the max lengths from the list to make the list elements equal in length before doing the rbind

    lst1 <- lapply(strsplit(df$comment, 
       "(topic \\d+)(*SKIP)(*F)|\\s+(?=[0-9#])", perl = TRUE), 
             function(x) sub("^\\D+", "", x))
    mx <- max(lengths(lst1))
    df[paste0('comment_', seq_len(mx))] <- do.call(rbind,
               lapply(lst1, `length<-`, mx))
    

    -output

    df
    # A tibble: 2 x 9
      comment                                                             mention_1 mention_2 mention_3 mention_4 comment_1          comment_2                 comment_3    comment_4  
      <chr>                                                                   <dbl>     <dbl>     <dbl>     <dbl> <chr>              <chr>                     <chr>        <chr>      
    1 topic 1: this is fine. 4 this is fine too. #9 not so good                   1         4         9        NA 1: this is fine.   4 this is fine too.       9 not so go… <NA>       
    2 1 ok this is fine. 17 i do not like this idea. 25 great idea 43 co…         1        17        25        43 1 ok this is fine. 17 i do not like this id… 25 great id… 43 cool id…
    >