Search code examples
rencodingdata-cleaning

Munging text strings with okinas and other Hawaiian diacritical marks


I am using R to clean street addresses from Hawaii. The addresses have been inputed with Hawaiian diacritical marks. When using R on an OSX operating system, I can easily use gsub() to remove the diacritics; however, PC running 64-bit Windows machines running R show strange characters, such as "â€" in place of the okina (‘). I suspect it could be an encoding issue, and have included the encoding parameter like the following:

address_file <- read.csv("file.csv", encoding="UTF-8")

Although most of the strange encoding was solved, R no longer could recognize certain diacritics such as the okina. For example, I would use the following syntax, but the okina will not be removed:

gsub("‘", "", hiplaces$name) 

Can someone please help with solving this issue on a PC running 64-bit Windows. I suspect it could be 1) an encoding issue and I am choosing the incorrect encoding, or 2) a gsub solution that can remove/replace diacritics. The data I am trying to clean looks something like below:

hiplaces <- data.frame(id = 1:3)
hiplaces$name <- c("‘Imiola Congregational Church", "‘Ōla‘a First Hawaiian    Congregational Church", "Nā‘ālehu Community Center")

gsub("‘", "", hiplaces$name) 

TIA.


Solution

  • Since your end result is a set of street addresses, you should be OK with simply retaining only alphanumeric characters. Under this assumption, the following should work:

    hiplaces <- data.frame(id = 1:3)
    hiplaces$name <- c("‘Imiola Congregational Church",
                       "‘Ōla‘a First Hawaiian    Congregational Church",
                       "Nā‘ālehu Community Center")
    
    hiplaces$name <- gsub("[^[:alnum:]///' ]", "", hiplaces$name)
    
    > hiplaces$name
    [1] "Imiola Congregational Church"
    [2] "Olaa First Hawaiian    Congregational Church"
    [3] "Naalehu Community Center"