Search code examples
mallet

How to get mallet to load all tokens from a line without a label?


I'm trying to perform topic modeling on a dataset that's in a whitespace delimited file, with no label. I can't get mallet to load all the tokens. I'm using version 2.0.8 on linux and mac.

As a test for the issue, I created a file with the one line:

1 2 3 4 5

Then ran

mallet import-file --token-regex [0-9]+ --keep-sequence true --label 0 --input testData --output testLoaded mallet train-topics --input testLoaded

I should get 4 tokens, but I only get 3:

Data loaded. max tokens: 3 total tokens: 3

It gets even worse if I try to use the --data flag (same result whether I use it and --label 0 or --data 2 on its own):

mallet import-file --token-regex [0-9]+ --keep-sequence true --label 0 --data 2 --input testData --output testLoaded2 mallet train-topics --input testLoaded2

Data loaded. max tokens: 1 total tokens: 1

So either I lose the first token, or I only get the first token (2 is appearing in the output later on, so I know it's not loading the rest of the line as a single token in the latter case).


Solution

  • Mallet parses lines in two phases: first, it segments the line into fields, using the --line-regex option. Then it maps those segments to one of the three instance fields (name, label, data).

    The command isn't working because it is only changing the second part, the mapping from regex groups to instance fields. It's telling Mallet to separate off the first two fields, but then ignore them. Here's an example of the default behavior:

    $ bin/mallet import-file --input token_test.txt --keep-sequence \
    --token-regex [0-9]+ --print-output 
    name: 1
    target: 2
    input: 0: 3 (0)
    1: 4 (1)
    2: 5 (2)
    

    If we add the --label 0 it just ignores the second field, but still captures it:

    $ bin/mallet import-file --input token_test.txt --keep-sequence \
    --token-regex [0-9]+ --label 0 --print-output 
    name: 1
    target: <null>
    input: 0: 3 (0)
    1: 4 (1)
    2: 5 (2)
    

    Now if we redefine the line regex, we can grab the whole line as a single field as use it all as data:

    $ bin/mallet import-file --input token_test.txt --keep-sequence \
    --token-regex [0-9]+ --line-regex '(.*)' --data 1 --name 0 --label 0 --print-output 
    name: csvline:1
    target: <null>
    input: 0: 1 (0)
    1: 2 (1)
    2: 3 (2)
    3: 4 (3)
    4: 5 (4)