Search code examples
rword-cloud

Wordcloud with each line as input R


I have a file with one column and 190178 lines, a few lines of which looks like this:

anatomical_structure_development
nucleic_acid_binding_transcription_factor_activity
molecular_function
biological_process
biosynthetic_process
cellular_nitrogen_compound_metabolic_process
embryo_development
anatomical_structure_formation_involved_in_morphogenesis
immune_system_process
biosynthetic_process
cellular_nitrogen_compound_metabolic_process
embryo_development

I want to make a wordcloud of this data using the tm and wordcloud package in R, taking each line as input, and making the wordcloud with the frequency of each line's occurrence. I've tried this using simple instructions from "speech" corpus formats but in that way, the word "process" has the highest frequency and gets the largest size, which is not what I want. I want the line with the highest frequency to be the largest.

I used the following code from common examples, but didn't get what I desired:

library(tm)
library(wordcloud)
GO <- Corpus(DirSource("/home/student-a/Desktop/Untitled Folder/"))
wordcloud(GO)

How can I do this?


Solution

  • This works on the example, but with wordcloud2. wordcloud gives warnings when the words are too long. Though wordcloud2 is also not very fast in drawing and you need to open the viewer to see the result

    anatomical_structure_formation_involved_in_morphogenesis could not be fit on page. It will not be plotted.

    Code with wordcloud2:

    library(wordcloud2)
    library(dplyr)
    
    text <- c("anatomical_structure_development",
              "nucleic_acid_binding_transcription_factor_activity",
              "molecular_function",
              "biological_process",
              "biosynthetic_process",
              "cellular_nitrogen_compound_metabolic_process",
              "embryo_development",
              "anatomical_structure_formation_involved_in_morphogenesis",
              "immune_system_process",
              "biosynthetic_process",
              "cellular_nitrogen_compound_metabolic_process",
              "embryo_development")
    
    # wordcloud2 needs a data.frame with frequencies. This will generate the table from the text.
    df <- text %>% data_frame(words = .) %>% 
      group_by(words) %>% 
      summarise(freq = n())
    
    wordcloud2(df)