Search code examples
machine-learningnlptopic-modelingtext-analysismallet

What do the parameters of the csvIterator mean in Mallet?


I am using mallet topic modelling sample code and though it runs fine, I would like to know what the parameters of this statement actually mean?

instances.addThruPipe(new CsvIterator(new FileReader(dataFile),
                                      "(\\w+)\\s+(\\w+)\\s+(.*)",
                                      3, 2, 1)  // (data, target, name) field indices                    
                     );

Solution

  • From the documentation:

    This iterator, perhaps more properly called a Line Pattern Iterator, reads through a file and returns one instance per line, based on a regular expression.

    If you have data of the form

    [name] [label] [data]

    The call you are interested in is

    CsvIterator(java.io.Reader input, java.lang.String lineRegex, 
                int dataGroup, int targetGroup, int uriGroup) 
    

    The first parameter is how data is read in, like a file reader or a string reader. The second parameter is the regex that is used to extract data from each line that's read from the reader. In your example, you've got (\\w+)\\s+(\\w+)\\s+(.*) which translates to:

    • 1 or more alphanumeric characters (capture group, this is the name of the instance), followed by
    • 1 or more whitespace character (tab, space, ..), followed by
    • 1 or more alphanumeric characters (capture group, this is the label/target), followed by
    • 1 or more whitespace character (tab, space, ..), followed by
    • 0 or more characters (this is the data)

    The numbers 3, 2, 1 indicate the data comes last, the target comes second, and the name comes first. The regex basically ensures the format of each line is as stated in the documentation:

    test1 spam Wanna buy viagra?
    test2 not-spam Hello, are you busy on Sunday?
    

    CsvIterator is a terrible name, because it is not actually comma-separated values that this class reads in, it is whitespace-separated (space, tab, ...) values.