Search code examples
rdistributiontmqdap

Can't plot Zipf's law in R


I have a big list of terms and their frequency loaded from a text file and I converted it to a table:

myTbl = read.table("word_count.txt")  # read text file 

colnames(myTbl)<-c("term", "frequency")
head(myTbl, n = 10)

> head(myTbl, n = 10)
    term frequency
1     de     35945
2      i     34850
3  \xe3n     19936
4      s     15348
5     cu     13722
6     la     13505
7     se     13364
8     pe     13361
9     nu     12693
10     o     11995

I should probably add a column with word rank and then plot rank against frequency, but how do I do this?


Solution

  • Rather than roll your own calculation, it would be easier to use the tm package. Convert myTbl to a term document matrix (tdm)

    library(tm)
    tdm <- TermDocumentMatrix(myTbl) # there are many more clean up steps, but I am simplifying 
    

    Then you you have not just Zipf but also Heaps and plots to display.

    Zipf_plot(tdm) 
    Heaps_plot(tdm) # how vocabulary grows as size of text grows
    

    Alternatively, you can use the qdap package and its rank frequency plots. Here is a quote from the vignette:

    Rank Frequency Plots are a way of visualizing word rank versus frequencies as related to Zipf's law which states that the rank of a word is inversely related to its frequency. The rank_freq_mplot and rank_freq_plot provide the means to plot the ranks and frequencies of words (with rank_freq_mplot plotting by grouping variable(s)).
    Rank_freq_mplot utilizes the ggplot2 package, whereas, rank_freq_plot employs base graphics.