I want to calculate pairwise word-frequencies of a large number of tweets I have collected. So that I can use them for visualizsation in Gephi (network graph). The current data looks like this (it is a character).
head(Tweet_text)
[1] "habits that separates successful persons from mediocre persons habit success startup entrepreneurship"
[2] "business entrepreneurship tech watch youtube star casey neistat ride a drone dressed as santa"
[3] "how to beat procrastination work deadlines workculture productivity management hr hrd entrepreneurship"
[4] "reading on entrepreneurship and startups and enjoying my latte"
[5] "not a very nice way to encourage entrepreneurship and in the same sentence dog another company"
[6] "us robotics founder on earlyday internet entrepreneurship articles management"
The structure is as follows:
str(Tweet_text)
chr [1:8661] "habits that separates successful persons from mediocre persons habit success startup entrepreneurship" ...
In this sample data set, I have 8661 tweets. Now I want to calculate pairwise word frequencies over all these tweets that I can export to Gephi. The end-result I am looking for is the following:
+------------------------+--------------+------+
| term1 | term 2 | Freq |
+------------------------+--------------+------+
| entrepreneurship | startup | 2 |
+------------------------+--------------+------+
So I started to use the DocumentTermMatrix function in the tm package:
dtm <- DocumentTermMatrix(Corpus(VectorSource(Tweet_text)))
This worked (see below frequency of "success" in the first tweet):
inspect(dtm[1, c("success")])
<<DocumentTermMatrix (documents: 1, terms: 1)>>
Non-/sparse entries: 1/0
Sparsity : 0%
Maximal term length: 7
Weighting : term frequency (tf)
Terms
Docs success
1 1
After this I tried to put these frequencies in the desired table format with:
m<-as.matrix(dtm)
m[m>=1] <- 1
m <- m %*% t(m)
Dat_Freq <- as.data.frame(as.table(m))
But now the first problem starts, that the matrix is just far too big. Next to that I do not know how I can restrict the pairwise-wordfrequencies to a specific value. For examle a pair needs to have a frequency > 10, so that the matrix does not get to large.
Would appreciate your advise on that, how I can get these pairwise-frequencies in a csv format.
All the best :)
Another thing you can do is use the tidytext package.
Let's say that your data are in a dataframe called tweets
and text
is the corresponding variable
library(tidytext)
library(dplyr)
tweets %>%
unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
count(bigram, sort = TRUE) %>%
head(100)
will give you the 100 most frequent bigrams. Of course it might be a good idea to remove stopwords first so take a look at the recipes in the Tidy text mining book