Search code examples
rstringreplacespecial-charactersgsub

R: Replace Special Characters


I have a dataframe as with special characters as below

Key  Q1   Q2
22   aSk   aÃ…Â k
23   aSk   aÃ…Â k
24   aSk   aÃ…Â k

I would like to replace the "Ã…Â k" (including the space between k) in Q2 with "aSk" to have result as below (same as Q1)

Key  Q1   Q2 
22   aSk   aSk
23   aSk   aSk
24   aSk   aSk

I have tried to use gsub function in R

df$Q2 <- gsub("[Ã…Â]", "S", df$Q2) 

but I'm unable to remove the "space" and get the result as below instead

Key  Q1   Q2 
22   aSk   aSSS k
23   aSk   aSSS k
24   aSk   aSSS k

Can I know what's wrong with my code and how to remove the "space" and "SSS" in R?

(The actual word in my raw file in csv is "aÅ k". However, it appears as "aÃ…Â k" in R)

Thanks.


Solution

  • We can match one or more characters that are not alpbabets and replace it with "S"

    df$Q2 <- sub("[^A-Za-z]+", "S", df$Q2)
    df$Q2
    #[1] "aSk" "aSk" "aSk"
    

    Or we capture only the alphabetic characters as a group (([A-Za-z]*) from the start (*) of the string, match the following characters that are non-alphabets and replace with the backreference of the captured group followed by "S"

    sub("^([A-Za-z]*)[^A-Za-z]+", "\\1S", df$Q2)
    #[1] "aSk" "aSk" "aSk"