Search code examples
rtwittertext-mining

How to match similar documents in R


I have created two corpuses: one containing tweet texts and another containing company names. What I'm trying to do is find which companies are mentioned in tweets.

Example document of a tweet:

> writeLines(as.character(tweet_corp[[175]]))
general motor send mexican made model chevi cruze us car dealer tax free across border make usaor pay big border tax

Example document of a company:

> writeLines(as.character(company_corp[[1397]]))
general motor

I would like an output that matches tweet_corp[[175]] with company_corp[[1397]]. Is there any way to do this?


Solution

  • You could use the stringr package to check whether a company name occurs in a tweet, e.g.

    library(stringr)
    
    company_name <- "general motor"
    
    tweet <- "general motor send mexican made model chevi cruze us car dealer tax free across border make usaor pay big border tax"
    
    # check whether a company name occurs in a string
    str_detect(
      string = tweet,
      pattern = coll(company_name)
    )