Search code examples
rplotigraphtm

Turkish characters problem while plotting graphs in R igraph


I have a dataset which includes Tweets in Turkish language. I'm trying to do text mining with tm package and plot the networks with igraph R packages.

    library(tm)
#build corpus
corpus <- iconv(deneme$text, to= "utf-8-mac")
corpus <- Corpus(VectorSource(corpus))
removeURL <- function(x) gsub('http[[:alnum:]]*', '', x)
cleanset <- tm_map(corpus, content_transformer(removeURL))
cleanset <- tm_map(cleanset, stripWhitespace)
#term document matrix
tdm <- TermDocumentMatrix(cleanset)
tdm <- as.matrix(tdm)
tdm <- tdm[rowSums(tdm)>30,]
tdm[tdm>1] <- 1
termM <- tdm %*% t(tdm)
#Network
g <- graph.adjacency(termM, weighted = T, mode = 'undirected') 
g <- simplify(g)
V(g)$label <- V(g)$name
V(g)$degree <- degree(g)
#plot
plot(g,
     vertex.color='green',
     vertex.size = 3,
     vertex.label.dist = 1.5)

Output plot

Turkish charachters such as "ş ğ ü" do not appear correctly. What might be the problem?

and this is my R studio locale settings:

Sys.getlocale()
[1] "en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8"

Solution

  • I tried to apply iconv() function to ş,ğ,ü characters by using all available encodings in iconvlist(), but nothing can print these characters perfectly on the R console and on the plot. I did that by using this code:

    encoded_text <- list()
    for (i in seq_along(iconvlist())) {
      tryCatch(print(eval(substitute(
        encoded_text[[i]] <- unlist(lapply(c("s", "g", "ü"), iconv,
          to = iconvlist()[i]
        ))
      ))),
      error = function(any_error_msg) message(as.character(any_error_msg))
      )
    }
    
    #To show all the results: 
    encoded_text
    

    I also tried utf8_print("ş,ğ,ü") from utf-8 package, but also failed.

    Finally, I found readtext package. This package can print these character properly on the console and on the plot in my computer. However, the current version of this package (v0.81) can only read a file, not a character vector. So, to use this package, I typed these characters in the Notepad, separated by commas, and then I saved the file with .txt extension.

    enter image description here

    Then, I used this code to extract these characters:

    library(readtext)
    mytext <- readtext('turkish_text.txt', encoding = 'utf-8')
    mytext <- unlist(strsplit(mytext$text, ","))
    mytext
    #[1] "ş" "ğ" "ü"
    

    They are properly printed on the console. Then, I tried to print them on the plot of an igraph object.

    adjm <- matrix(1:9, nc=3)
    g1 <- graph_from_adjacency_matrix( adjm )
    g1 <- g1 %>% set_vertex_attr("name", value = mytext)
    plot(g1)
    

    Here is the resulted plot:

    enter image description here

    The characters are properly printed on the plot.

    Of course no guarantee that this approach will be applicable to other Turkish characters, but I think it's worthy to try.