Search code examples
rstemminglemmatization

How to perform stemming and lemmatization in R?


I am processing text data which have string like given below

"significant step towards large scale hydrogen production iisc team collaboration jncasr researcher develop low cost catalyst speed split water generate hydrogen gas"

In order to get correct form of words in text...stemming or lemmatization is to be done. I am doing this, but its not giving the desired output

stemDocument(p[1], language = "english")

[1] "signific step toward larg scale hydrogen product iisc team collabor jncasr research develop low cost catalyst speed split water generat hydrogen gas"

lemmatize_strings(p[1], dictionary = lexicon::hash_lemmas)

[1] "significant step towards large scale hydrogen production iisc team collaboration jncasr researcher develop low cost catalyst speed split water generate hydrogen gas"

How to get the output like this

significant step toward large scale hydrogen produce iisc team collaborate jncasr research develop low cost catalyst speed split water generate hydrogen gas


Solution

  • It is probably worth giving the package that you are using. To do what you wish you could do the following with the following two packages

    library(udpipe)
    
    # This takes a minute to download the english dictionary
    x <- udpipe(x = "significant step towards large scale hydrogen production iisc team 
                collaboration jncasr researcher develop low cost catalyst 
                speed split water generate hydrogen gas",
                object = "english")
    
    

    This will give you all kinds of information for your analysis, including the token, the lemma, etc. You can do a lot with this.

     x$lemma
     [1] "significant"   "step"          "towards"       "large"         "scale"         "hydrogen"      "production"   
     [8] "iisc"          "team"          "collaboration" "jncasr"        "researcher"    "develop"       "low"          
    [15] "cost"          "catalyst"      "speed"         "split"         "water"         "generate"      "hydrogen"     
    [22] "gas" 
    
    

    To stem the word you could use the tm package. If you want to stem the lemmas you have them:

    library(tm)
    tm::stemDocument(x$lemma)
    
    

    Which will give you the following:

    [1] "signific" "step"     "toward"   "larg"     "scale"    "hydrogen" "product"  "iisc"     "team"     "collabor"
    [11] "jncasr"   "research" "develop"  "low"      "cost"     "catalyst" "speed"    "split"    "water"    "generat" 
    [21] "hydrogen" "gas"