Search code examples
nlpnltkstemmingporter-stemmer

NLP: Stemming on opcodes data set


I have a dataset of 27 files, each containing opcodes. I want to use stemming to map all versions of similar opcodes into the same opcode. For example: push, pusha, pushb, etc would all be mapped to push; addf addi to add, multi multf to mult, etc.). How can I do so? I tried using PorterStemmer with NLTK extensions but it is not working on my dataset. I think it works only on normal human lingual words. (Like played, playing --> play) and not on these opcodes like (pusha, pushb --> push).


Solution

  • I don't think a stemming is what you want to do here. Stemmers are language specific and are based on the common inflectional morphological patterns in that language. For example, in English, you have the infinitival forms of verbs (e.g., "to walk") which becomes inflected for tense, aspect, & person/number: I walk vs. She walks (walk+s), I walk vs. walked (walk+ed), also walk+ing, etc. Stemmers codify these stochastic distributions into "rules" that are then applied on a "word" to change into its stem. In other words, an off-the-shelf stemmer does not exist for your opcodes.

    You have two possible solutions: (1) create a dictionary or (2) write your own stemmer. If you don't have too many variants to map, it is probably quickest to just create a custom dictionary where you use all your word variants as keys and the lemma/stem/canonical-form is the value.

    addi -> add
    addf -> add
    multi -> mult
    multf -> mult
    

    If your potential mappings are too numerous to do by hand, then you could write a custom regex stemmer to do the mapping and conversion. Here is how you might do it in R. The following function takes an input word and tries to match it to a pattern representing all the variants of a stem, for all the n stems in your collection. It returns a 1 x n data.frame with 1 indicating presence or 0 indicating absence of variant match.

    #' Return word's stem data.frame with each column indicating presence (1) or 
    #' absence (0) of stem in that word.
    map_to_stem_df <- function(word) {
      ## named list of patterns to match
      stem_regex <- c(add = "^add[if]$", 
                      mult = "^mult[if]$")
    
      ## iterate across the stem names
      res <- lapply(names(stem_regex), function(stem) {
    
        pat <- stem_regex[stem]
        ## if pattern matches word, then 1 else 0
        if (grepl(pattern = pat, x = word))  {
          pat_match <- 1
        } else {
          pat_match <- 0  
        }
        ## create 1x1 data.frame for stem
        df <- data.frame(pat_match) 
        names(df) <- stem
        return(df)
      })
      ## bind all cols into single row data.frame 1 x length(stem_regex) & return
      data.frame(res)
    
    }
    
    map_to_stem_df("addi")
    #  add mult
    #    1    0
    
    map_to_stem_df("additional")
    # add mult
    #   0    0