I am trying to create a corpus from Java source code.
I am following the preprocessing steps in this paper http://cs.queensu.ca/~sthomas/data/Thomas_2011_MSR.pdf
Based on the section [2.1] the following thing should be removed:
- characters related to the syntax of the programming language [already done by removePunctuation]
- programming language keywords [already done by tm_map(dsc, removeWords, javaKeywords)]
- common English-language stopwords [already done by tm_map(dsc, removeWords, stopwords("english"))]
- word stemming [already done by tm_map(dsc, stemDocument)]
The remaining part is to split identifier and method names into multiple parts based on common naming conventions.
For example 'firstName' should be split into 'first' and 'name'.
Another example 'calculateAge' should be split into 'calculate' and 'age'.
Can anybody help me with this?
library(tm)
dd = DirSource(pattern = ".java", recursive = TRUE)
javaKeywords = c("abstract","continue","for","new","switch","assert","the","default","package","synchronized","boolean","do","if","private","this","break","double","implements","protected","throw","byte","else","the","null","NULL","TRUE","FALSE","true","false","import","public","throws","case","enum", "instanceof","return","transient","catch","extends","int","short","try","char","final","interface","static","void","class","finally","long","volatile","const","float","native","super","while")
dsc <- Corpus(dd)
dsc <- tm_map(dsc, stripWhitespace)
dsc <- tm_map(dsc, removePunctuation)
dsc <- tm_map(dsc, removeNumbers)
dsc <- tm_map(dsc, removeWords, stopwords("english"))
dsc <- tm_map(dsc, removeWords, javaKeywords)
dsc = tm_map(dsc, stemDocument)
dtm<- DocumentTermMatrix(dsc, control = list(weighting = weightTf, stopwords = FALSE))
I've written a tool in Perl to do all kinds of source code preprocessing, including identifier splitting:
https://github.com/stepthom/lscp
The relevant piece of code there is:
=head2 tokenize
Title : tokenize
Usage : tokenize($wordsIn)
Function : Splits words based on camelCase, under_scores, and dot.notation.
: Leaves other words alone.
Returns : $wordsOut => string, the tokenized words
Args : named arguments:
: $wordsIn => string, the white-space delimited words to process
=cut
sub tokenize{
my $wordsIn = shift;
my $wordsOut = "";
for my $w (split /\s+/, $wordsIn) {
# Split up camel case: aaA ==> aa A
$w =~ s/([a-z]+)([A-Z])/$1 $2/g;
# Split up camel case: AAa ==> A Aa
# Split up camel case: AAAAa ==> AAA Aa
$w =~ s/([A-Z]{1,100})([A-Z])([a-z]+)/$1 $2$3/g;
# Split up underscores
$w =~ s/_/ /g;
# Split up dots
$w =~ s/([a-zA-Z0-9])\.+([a-zA-Z0-9])/$1 $2/g;
$wordsOut = "$wordsOut $w";
}
return removeDuplicateSpaces($wordsOut);
}
The above hacks are based on my own experience with preprocessing source code for textual analysis. Feel free to steal and modify.