Search code examples
rtextdata-cleaning

Replacing some characters in a text column in R


I have a dataset with a text column that includes texts and a term that starts with a term such as sa and with two following digits after. The letters can be anything from a to z and either small or capital. A snapshot of the data is as follows:

df_new <- data.frame(
  given_info=c('SA12 is given','he has his sa12',
         'she will get Sr15','why not having an ra31',
         'his tA23 is missing', 'pa12 is given'))

df_new %>% select(given_info)

              given_info
1          SA12 is given
2        he has his sa12
3      she will get Sr15
4 why not having an ra31
5    his tA23 is missing
6          pa12 is given

I need to replace any term that has the sa (or any other combinations of two random letters with the two digits after with the term document . Hence, the outcome of interest is:

              given_info
1          document is given
2        he has his document
3      she will get document
4      why not having an document
5    his document is missing
6          document is given

Thank you so much for your help in advance!


Solution

  • We can use gsub() here as follows:

    df_new$given_info <- gsub("\\b[A-Za-z]{2}\\d{2}\\b", "document", df_new$given_info)
    df_new
    
                      given_info
    1          document is given
    2        he has his document
    3      she will get document
    4 why not having an document
    5    his document is missing
    6          document is given
    

    The regex pattern used here says to match:

    • \b a word boundary (meaning what precedes is NOT a word character)
    • [A-Za-z]{2} match any 2 letters
    • \d{2} match 2 digits
    • \b another word boundary (what follows the digits is NOT a word character)

    The word boundaries ensure, for example, that abc12 in your text does not get replaced with document. If we didn't use the word boundaries, then you would also get substring matches, which maybe you don't want.