Search code examples
rstringextract

Extract certain part from a string in R


I would like to extract a part of the string. Here is an example dataset.

df <- data.frame(id = c(1,2),
                 string = c('<itemResponse><response id="editIn_1.RESPONSE_1"><value>ETC_CHOICE_2</value>',
                            '<itemResponse><response id="editIn_1.RESPONSE_1"><value>ETC_CHOICE_4</value>'))

> df
  id                                                                       string
1  1 <itemResponse><response id="editIn_1.RESPONSE_1"><value>ETC_CHOICE_2</value>
2  2 <itemResponse><response id="editIn_1.RESPONSE_1"><value>ETC_CHOICE_4</value>

I would like to extract ETC_CHOICE_2 and ETC_CHOICE_4 from the long string. My desired output would be:

> df
  id                                                                       string  extract
1  1 <itemResponse><response id="editIn_1.RESPONSE_1"><value>ETC_CHOICE_2</value>  ETC_CHOICE_2
2  2 <itemResponse><response id="editIn_1.RESPONSE_1"><value>ETC_CHOICE_4</value>  ETC_CHOICE_4

Does anyone have any idea?

Thanks!


Solution

  • An option is to use htmlParse from XML

    library(XML)
    library(dplyr)
    df %>% 
      mutate(extract = htmlParse(string) %>%
                        getNodeSet("//value") %>%
                        xmlValue)
    

    -output

    #id                                                                       string      extract
    #1  1 <itemResponse><response id="editIn_1.RESPONSE_1"><value>ETC_CHOICE_2</value> ETC_CHOICE_2
    #2  2 <itemResponse><response id="editIn_1.RESPONSE_1"><value>ETC_CHOICE_4</value> ETC_CHOICE_4