I have a corpus of 11 text documents. I have found word associations using the commands:
findAssocs(dtm, c("youngster","campaign"), corlimit=0.9)
findAssocs(dtms, "corruption", corlimit=0.9)
dtm is a document term matrix.
dtm <- DocumentTermMatrix(docs)
where docs is the corpus.
dtms is the document term matrix after removing 10% sparse terms.
dtms <- removeSparseTerms(dtm, 0.1)
I would like to plot the correlated terms I got against (i) 2 specific words and (ii) 1 specific word I tried following this post : Plot highly correlated words against a specific word of interest
toi <- "corruption" # term of interest
corlimit <- 0.9 # lower correlation bound limit.
cor_0.9 <- data.frame(corr = findAssocs(dtm, toi, corlimit)[,1],terms=row.names(findAssocs(dtm, toi, corlimit)))
But unfortunately the code :
cor_0.9 <- data.frame(corr = findAssocs(dtm, toi, corlimit)[,1],terms=row.names(findAssocs(dtm, toi, corlimit)))
gives me an error :
Error in findAssocs(dtm, toi, corlimit)[, 1]:incorrect number of dimensions
This is the structure of the document term matrix:
<<DocumentTermMatrix (documents: 11, terms: 1847)>>
Non-/sparse entries: 8024/12293
Sparsity : 61%
Maximal term length: 23
Weighting : term frequency (tf)
and in the environemt it is of form:
dtm List of 6
i: int [1:8024] 1 1 1 1 1 ...
j: int [1:8024] 17 29 34 43 47 ...
v: num [1:8024] 9 4 9 5 5 ...
nrow : int 11
ncol : int 1847
dimnames: list of 2
...$ Docs : chr [1:11] "character (0)" "character (0)" "character (0)"
...$ Terms: chr [1:1847] "campaigning"|__truncated__"a"|__"truncated"__
attr(*,"class") = chr [1:2] "DocumentTermMatrix" "simple_triplet_matrix"...
attr(*,"weighting") = chr [1:2] "term frequency" "tf"
How do I plot word correlations for a single word and multiple words? Please help.
Here is the output of
findAssocs(dtm, c("youngster","campaign"), corlimit=0.9)
character colleges controversi expect corrupt much
1.00 1.00 1.00 1.00 0.99 0.99
okay saritha existing leads satisfi social
0.99 0.99 0.98 0.98 0.98 0.98
basic make lack internal general method satisfied time
0.95 0.95 0.94 0.93 0.92 0.92 0.92 0.92
A slightly different approach is required for two words, here's a quick attempt:
tdm <- TermDocumentMatrix(crude)
# Compute correlations and store in data frame...
toi1 <- "oil" # term of interest
toi2 <- "winter"
corlimit <- 0.7 # lower correlation bound limit.
corr1 <- findAssocs(tdm, toi1, corlimit)[[1]]
corr1 <- cbind(read.table(text = names(corr1), stringsAsFactors = FALSE), corr1)
corr2 <- findAssocs(tdm, toi2, corlimit)[[1]]
corr2 <- cbind(read.table(text = names(corr2), stringsAsFactors = FALSE), corr2)
# join them together
two_terms_corrs <- full_join(corr1, corr2)
# gather for plotting
two_terms_corrs_gathered <- gather(two_terms_corrs, term, correlation, corr1:corr2)
# insert the actual terms of interest so they show up on the legend
two_terms_corrs_gathered$term <- ifelse(two_terms_corrs_gathered$term == "corr1", toi1, toi2)
# Draw the plot...
ggplot(two_terms_corrs_gathered, aes(x = V1, y = correlation, colour = term ) ) +
geom_point(size = 3) +
ylab(paste0("Correlation with the terms ", "\"", toi1, "\"", " and ", "\"", toi2, "\"")) +
theme_bw() +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))