I am using mallet topic modelling sample code and though it runs fine, I would like to know what the parameters of this statement actually mean?
instances.addThruPipe(new CsvIterator(new FileReader(dataFile),
"(\\w+)\\s+(\\w+)\\s+(.*)",
3, 2, 1) // (data, target, name) field indices
);
From the documentation:
This iterator, perhaps more properly called a Line Pattern Iterator, reads through a file and returns one instance per line, based on a regular expression.
If you have data of the form
[name] [label] [data]
The call you are interested in is
CsvIterator(java.io.Reader input, java.lang.String lineRegex,
int dataGroup, int targetGroup, int uriGroup)
The first parameter is how data is read in, like a file reader or a string reader. The second parameter is the regex that is used to extract data from each line that's read from the reader. In your example, you've got (\\w+)\\s+(\\w+)\\s+(.*)
which translates to:
The numbers 3, 2, 1
indicate the data comes last, the target comes second, and the name comes first. The regex basically ensures the format of each line is as stated in the documentation:
test1 spam Wanna buy viagra?
test2 not-spam Hello, are you busy on Sunday?
CsvIterator
is a terrible name, because it is not actually comma-separated values that this class reads in, it is whitespace-separated (space, tab, ...) values.