Search code examples
rdataframeprobabilitypcalda

Creating a list of suggested words based on frequency of words in multiple dataframes


So I have 3 dataframes in R each with words and the frequency that the word appears in a document (of which the df represents). I am creating an app in R Shiny where users can search words and it returns the pdfs which contain the word. So I would like to add functionality where the user is provided with words that are recommended based on the other dataframes.

An example:

So let's say the user enters the word "examination". The word "examination" exists in two of the dataframes so it recommends words from these dataframes and this process repeats so u can find the best words possible given the dataframes we have. I was hoping there is a package which could do this or alternatively implementing maybe PCA or LDA/QDA.

Any ideas?

Here are the 3 dataframes to try, but only the top 20 entries

df1 <- structure(list(word = c("data", "summit", "research", "program", 
"analysis", "study", "evaluation", "minority", "federal", "department", 
"statistical", "experience", "business", "design", "education", 
"response", "sampling", "learning", "project", "review"), n = c(213L, 
131L, 101L, 98L, 90L, 84L, 82L, 82L, 76L, 72L, 65L, 63L, 60L, 
58L, 58L, 58L, 55L, 50L, 50L, 46L)), row.names = c(NA, -20L), class = c("tbl_df", 
"tbl", "data.frame"))

df2 <- structure(list(word = c("regression", "sampling", "research", "forecast", 
"analysis", "development", "disparity", "firms", "impact", "office", 
"statistical", "experience", "sample", "support", "consulting", 
"provide", "contract", "technology", "result", "system"), n = c(113L, 
89L, 76L, 24L, 20L, 20L, 19L, 16L, 26L, 10L, 9L, 4L, 2L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L)), row.names = c(NA, -20L), class = c("tbl_df", 
"tbl", "data.frame"))

df3 <- structure(list(word = c("knowledge", "veteran", "association", "compliance", 
"random", "safety", "treatment", "analyst", "legal", "welfare", 
"selection", "solicitation", "tasks", "personnel", "student", 
"estimating", "investigation", "multivariate", "result", "system"), n = c(302L, 
300L, 279L, 224L, 199L, 180L, 156L, 112L, 101L, 100L, 100L, 67L, 56L, 
55L, 55L, 54L, 23L, 23L, 22L, 11L)), row.names = c(NA, -20L), class = c("tbl_df", 
"tbl", "data.frame"))


Ideally I would like R to return words with a high probability of being in the same document as the one you have already entered.


Solution

  • Ideally I would like R to return words with a high probability of being in the same document as the one you have already entered.

    If you are looking for just word co-occurrence or similarity, you may want to look at Word2Vec or Wikipedia2Vec- there are some fascinating things you can do to texts with vector-based methods.

    But: given your comment above about not using word counts

    what I am asking is if a user enters a word I would like to provide words that could also be helpful. This means returning words with high likelihood from the pdfs that the word they searched are in

    I think what you want might be different. I interpret your question as a user has a word "orange" and he wants to know which documents contain related concepts, such as "tree" or "juice" or "California".

    If you are looking for similarity between documents, you are describing a use case for a topic model. Latent Dirichlet Allocation, the most basic topic model, is also abbreviated LDA, but is not the same.

    LDA Intuition

    You can think of LDA as PCA for unstructured textual data.

    It extracts the "latent" topics" from a document - I won't go into details here, but essentially it checks which words keep popping up together in different documents, and then groups them into "topics" .

    Example: Documents about oranges are also more likely to contain words like "tree" or "juice" while documents about cars will likely contain "gasoline" and "motor" - if you use a large enough collection of texts, you will be able to tell documents that are about pickups from documents about orange juice using some similarity measure (I'd go for soft cosine similarity), and you will be able to tell that an article on transportation costs of oranges is about both. Importantly, it also assigns words to topics, so "orange" would have a high loading into the "oranges and related stuff" topic, and a low loading into the cars topic - since the color is probably discussed less with cars, and articles about orange logistics are likely rare.

    Implementation (very rough guideline)

    Assuming what I am assuming about your project (mainly that this is what you want and that you have the original documents, also that the split into three dataframes does not matter), this is one way to go about it:

    1. Take your documents and run LDA on them
    2. Write some code that takes a word as input and then returns some (say 2-5) topics into which this word loads and say, the top 10 words from within this topic. Alternatively, use distance measures (i.e. Hellinger to compare the documents).
    3. Done. You could also let users take an entire document and let the algorithm find similar ones (see below).

    Application in the Wild

    You can check out the JSTOR Text Analyzer which does exactly how I interpret your use case. You upload a document of and it returns similar documents. (It uses LDA)

    Packages: e.g. lda or topicmodels, there are others that have this functionality.

    (Side note: The acronym LDA is the reason I found this post by accident...)