Search code examples
rtext-miningtm

Split Identifier and Method Names in Creating Source Code Corpus


I am trying to create a corpus from Java source code.
I am following the preprocessing steps in this paper http://cs.queensu.ca/~sthomas/data/Thomas_2011_MSR.pdf

Based on the section [2.1] the following thing should be removed:
- characters related to the syntax of the programming language [already done by removePunctuation]
- programming language keywords [already done by tm_map(dsc, removeWords, javaKeywords)]
- common English-language stopwords [already done by tm_map(dsc, removeWords, stopwords("english"))]
- word stemming [already done by tm_map(dsc, stemDocument)]

The remaining part is to split identifier and method names into multiple parts based on common naming conventions.

For example 'firstName' should be split into 'first' and 'name'.

Another example 'calculateAge' should be split into 'calculate' and 'age'.
Can anybody help me with this?

    library(tm)
    dd = DirSource(pattern = ".java", recursive = TRUE)
    javaKeywords = c("abstract","continue","for","new","switch","assert","the","default","package","synchronized","boolean","do","if","private","this","break","double","implements","protected","throw","byte","else","the","null","NULL","TRUE","FALSE","true","false","import","public","throws","case","enum", "instanceof","return","transient","catch","extends","int","short","try","char","final","interface","static","void","class","finally","long","volatile","const","float","native","super","while")
    dsc <- Corpus(dd)
    dsc <- tm_map(dsc, stripWhitespace)
    dsc <- tm_map(dsc, removePunctuation)
    dsc <- tm_map(dsc, removeNumbers)
    dsc <- tm_map(dsc, removeWords, stopwords("english"))
    dsc <- tm_map(dsc, removeWords, javaKeywords)
    dsc = tm_map(dsc, stemDocument)
    dtm<- DocumentTermMatrix(dsc, control = list(weighting = weightTf, stopwords = FALSE))

Solution

  • I've written a tool in Perl to do all kinds of source code preprocessing, including identifier splitting:

    https://github.com/stepthom/lscp

    The relevant piece of code there is:

    =head2 tokenize
     Title    : tokenize
     Usage    : tokenize($wordsIn)
     Function : Splits words based on camelCase, under_scores, and dot.notation.
              : Leaves other words alone.
     Returns  : $wordsOut => string, the tokenized words
     Args     : named arguments:
              : $wordsIn => string, the white-space delimited words to process
    =cut
    sub tokenize{
        my $wordsIn  = shift;
        my $wordsOut = "";
    
        for my $w (split /\s+/, $wordsIn) {
            # Split up camel case: aaA ==> aa A
            $w =~ s/([a-z]+)([A-Z])/$1 $2/g;
    
            # Split up camel case: AAa ==> A Aa
            # Split up camel case: AAAAa ==> AAA Aa
            $w =~ s/([A-Z]{1,100})([A-Z])([a-z]+)/$1 $2$3/g;
    
            # Split up underscores 
            $w =~ s/_/ /g;
    
            # Split up dots
            $w =~ s/([a-zA-Z0-9])\.+([a-zA-Z0-9])/$1 $2/g;
    
            $wordsOut = "$wordsOut $w";
        }
    
        return removeDuplicateSpaces($wordsOut);
    }
    

    The above hacks are based on my own experience with preprocessing source code for textual analysis. Feel free to steal and modify.