So I have 3 dataframes in R each with words and the frequency that the word appears in a document (of which the df represents). I am creating an app in R Shiny where users can search words and it returns the pdfs which contain the word. So I would like to add functionality where the user is provided with words that are recommended based on the other dataframes.
An example:
So let's say the user enters the word "examination". The word "examination" exists in two of the dataframes so it recommends words from these dataframes and this process repeats so u can find the best words possible given the dataframes we have. I was hoping there is a package which could do this or alternatively implementing maybe PCA or LDA/QDA.
Any ideas?
Here are the 3 dataframes to try, but only the top 20 entries
df1 <- structure(list(word = c("data", "summit", "research", "program",
"analysis", "study", "evaluation", "minority", "federal", "department",
"statistical", "experience", "business", "design", "education",
"response", "sampling", "learning", "project", "review"), n = c(213L,
131L, 101L, 98L, 90L, 84L, 82L, 82L, 76L, 72L, 65L, 63L, 60L,
58L, 58L, 58L, 55L, 50L, 50L, 46L)), row.names = c(NA, -20L), class = c("tbl_df",
"tbl", "data.frame"))
df2 <- structure(list(word = c("regression", "sampling", "research", "forecast",
"analysis", "development", "disparity", "firms", "impact", "office",
"statistical", "experience", "sample", "support", "consulting",
"provide", "contract", "technology", "result", "system"), n = c(113L,
89L, 76L, 24L, 20L, 20L, 19L, 16L, 26L, 10L, 9L, 4L, 2L,
1L, 1L, 1L, 1L, 1L, 1L, 1L)), row.names = c(NA, -20L), class = c("tbl_df",
"tbl", "data.frame"))
df3 <- structure(list(word = c("knowledge", "veteran", "association", "compliance",
"random", "safety", "treatment", "analyst", "legal", "welfare",
"selection", "solicitation", "tasks", "personnel", "student",
"estimating", "investigation", "multivariate", "result", "system"), n = c(302L,
300L, 279L, 224L, 199L, 180L, 156L, 112L, 101L, 100L, 100L, 67L, 56L,
55L, 55L, 54L, 23L, 23L, 22L, 11L)), row.names = c(NA, -20L), class = c("tbl_df",
"tbl", "data.frame"))
Ideally I would like R to return words with a high probability of being in the same document as the one you have already entered.
Ideally I would like R to return words with a high probability of being in the same document as the one you have already entered.
If you are looking for just word co-occurrence or similarity, you may want to look at Word2Vec or Wikipedia2Vec- there are some fascinating things you can do to texts with vector-based methods.
But: given your comment above about not using word counts
what I am asking is if a user enters a word I would like to provide words that could also be helpful. This means returning words with high likelihood from the pdfs that the word they searched are in
I think what you want might be different. I interpret your question as a user has a word "orange" and he wants to know which documents contain related concepts, such as "tree" or "juice" or "California".
If you are looking for similarity between documents, you are describing a use case for a topic model. Latent Dirichlet Allocation, the most basic topic model, is also abbreviated LDA, but is not the same.
LDA Intuition
You can think of LDA as PCA for unstructured textual data.
It extracts the "latent" topics" from a document - I won't go into details here, but essentially it checks which words keep popping up together in different documents, and then groups them into "topics" .
Example: Documents about oranges are also more likely to contain words like "tree" or "juice" while documents about cars will likely contain "gasoline" and "motor" - if you use a large enough collection of texts, you will be able to tell documents that are about pickups from documents about orange juice using some similarity measure (I'd go for soft cosine similarity), and you will be able to tell that an article on transportation costs of oranges is about both. Importantly, it also assigns words to topics, so "orange" would have a high loading into the "oranges and related stuff" topic, and a low loading into the cars topic - since the color is probably discussed less with cars, and articles about orange logistics are likely rare.
Implementation (very rough guideline)
Assuming what I am assuming about your project (mainly that this is what you want and that you have the original documents, also that the split into three dataframes does not matter), this is one way to go about it:
Application in the Wild
You can check out the JSTOR Text Analyzer which does exactly how I interpret your use case. You upload a document of and it returns similar documents. (It uses LDA)
Packages: e.g. lda or topicmodels, there are others that have this functionality.
(Side note: The acronym LDA is the reason I found this post by accident...)