I m trying to create clusters from data based on the string value of each row. I m using the R langage. What I m calling a "cluster" is a big thematic (= family) that can define each keywords. I imagine something autogenearated based on the keyword, maybe by using lemmatization or ngram.
For example both keywords "cloud services" and "the cloud service" should be in the "service" cluster.
Here is my input vector:
keywords_df <- c("cloud storage", "cloud computing", "google cloud storage", "the cloud service",
"free cloud storage", "what is cloud computing", "best cloud storage","cloud computing definition",
"amazon cloud services", "cloud service providers", "cloud services", "google cloud computing", "cloud computing services", "benefits of cloud computing")
Here is the expected output dataframe:
| Keyword | Thematic |
|---------------------------|:---------:|
|cloud storage |storage |
|cloud computing |computing|
|google cloud storage |storage |
|the cloud service |service |
|free cloud storage |storage |
|what is cloud computing |computing|
|best cloud storage |storage |
|cloud computing definition |computing|
|amazon cloud service |service |
|cloud service providers |services |
|cloud service |service |
|google cloud computing |computing|
|cloud computing services |service |
|benefits of cloud computing|computing|
The goal is to clean up the data in the "keyword" column and auto extract a kind of lemm or ngram.
Here is what I have done for now :
Create the "Thematic" column based on keyword column:
keywords_df <- mutate(keywords_df,Thematic=Keyword)
keywords_df$Thematic <- as.character(keywords_df$Thematic)
Remove Stopwords:
stopwords_list<-(c("cloud")) #Remove the main word
stopwords <- stopwords(kind = "en")
stopwords <- append(stopwords,stopwords_list)
x = keywords_df$Thematic
x = removeWords(x,stopwords)
keywords_df$Thematic <- x
You can check the presence of certain words like storage
, computing
and service
by using grepl()
. This way, you can check for the presence of a given word in df
:
fams <- c("storage", "computing", "service")
family <- rep("emtpy_fam", length(df))
for(fam in fams){
family[grepl(fam, Keywords)] <- fam
}
cbind(df, family)
# Keywords family
# [1,] "cloud storage" "storage"
# [2,] "cloud computing" "computing"
---
#[13,] "cloud computing services" "service"
#[14,] "benefits of cloud computing" "computing"
There are certainly nicer ways of doing this, though
Edit: Nicer way to do it, using the stringr
package
library(stringr)
family <- str_extract(df, pattern="storage|computing|service")
cbind(df, family)
Edit2: I see your latest edit, indicating that you are looking for non pre-specified family descriptions. The first method I think of in such a case is Latent Dirichlet Allocation (LDA - not to be confused with Linear Discriminant Analysis, though).
LDA analyzes a corpus of documents and identifies latent topics as a distribution of words (found like terms(lda.output)
below) and identifies which document belongs to which topic (found like topic(lda.output)
below):
library(topicmodels)
library(tm)
# Preliminary textmining
corpus <- Corpus(VectorSource(df))
corpus <- tm_map(corpus, removeWords, "cloud")
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, stemDocument)
# Term Frequency matrix
TF <- DocumentTermMatrix(corpus, control = list(weighting = weightTf))
lda.output <- LDA(TF, k=3)
terms(lda.output)
# Topic 1 Topic 2 Topic 3
# "servic" "comput" "storag"
cbind(df, terms(lda.output)[topics(lda.output)])
# df
#Topic 3 "cloud storage" "storag"
#Topic 2 "cloud computing" "comput"
#Topic 3 "google cloud storage" "storag"
#Topic 1 "cloud services" "servic"
#Topic 3 "free cloud storage" "storag"
#Topic 2 "what is cloud computing" "comput"
#Topic 3 "best cloud storage" "storag"
#Topic 1 "cloud computing definition" "servic"
#Topic 1 "amazon cloud services" "servic"
#Topic 3 "cloud service providers" "storag"
#Topic 2 "google cloud services" "comput"
#Topic 2 "google cloud computing" "comput"
#Topic 1 "cloud computing services" "servic"
#Topic 2 "benefits of cloud computing" "comput"
Final note: If you wish to get "computing"
instead of "comput"
etc., you should change the stemming part in the text-mining. You can also leave this out, but then "service"
and "services"
will not be recognised as the same word. You could, however, manually replace "service"
with "services"
or vice versa.