Search code examples
rtwitterfrequencytmgephi

Convert large character into word frequency matrix for Gephi in R


I want to calculate pairwise word-frequencies of a large number of tweets I have collected. So that I can use them for visualizsation in Gephi (network graph). The current data looks like this (it is a character).

head(Tweet_text)
[1] "habits that separates successful persons from mediocre persons habit success startup entrepreneurship"                 
[2] "business entrepreneurship tech watch youtube star casey neistat ride a drone dressed as santa"        
[3] "how to beat procrastination work deadlines workculture productivity management hr hrd entrepreneurship"
[4] "reading on entrepreneurship and startups and enjoying my latte"                                        
[5] "not a very nice way to encourage entrepreneurship and in the same sentence dog another company"        
[6] "us robotics founder on earlyday internet entrepreneurship articles management" 

The structure is as follows:

str(Tweet_text)
 chr [1:8661] "habits that separates successful persons from mediocre persons habit success startup entrepreneurship" ...

In this sample data set, I have 8661 tweets. Now I want to calculate pairwise word frequencies over all these tweets that I can export to Gephi. The end-result I am looking for is the following:

+------------------------+--------------+------+
| term1                  | term 2       | Freq |
+------------------------+--------------+------+
| entrepreneurship       | startup      |  2   |
+------------------------+--------------+------+

So I started to use the DocumentTermMatrix function in the tm package:

dtm <- DocumentTermMatrix(Corpus(VectorSource(Tweet_text)))

This worked (see below frequency of "success" in the first tweet):

inspect(dtm[1, c("success")])
<<DocumentTermMatrix (documents: 1, terms: 1)>>
Non-/sparse entries: 1/0
Sparsity           : 0%
Maximal term length: 7
Weighting          : term frequency (tf)

    Terms
Docs success
   1       1

After this I tried to put these frequencies in the desired table format with:

m<-as.matrix(dtm)
m[m>=1] <- 1
m <- m %*% t(m)
Dat_Freq <- as.data.frame(as.table(m))

But now the first problem starts, that the matrix is just far too big. Next to that I do not know how I can restrict the pairwise-wordfrequencies to a specific value. For examle a pair needs to have a frequency > 10, so that the matrix does not get to large.

Would appreciate your advise on that, how I can get these pairwise-frequencies in a csv format.

All the best :)


Solution

  • Another thing you can do is use the tidytext package.

    Let's say that your data are in a dataframe called tweets and text is the corresponding variable

    library(tidytext)
    library(dplyr)
    
    tweets %>%
       unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
       count(bigram, sort = TRUE) %>%
       head(100)
    

    will give you the 100 most frequent bigrams. Of course it might be a good idea to remove stopwords first so take a look at the recipes in the Tidy text mining book