I'm using mahout 0.7 on a pseudo-distributed hadoop installation for testing purposes.
A lot of what I'm doing is being guided by Mahout in Action, which I know deals with 0.5, but as far as I can tell, nothing major has changed with seq2sparse.
I'm having a problem with the tfidf vectors generated by seq2sparse. No matter what I set "-x" (max document frequency percentage) to, I end up with the same number of terms in my dictionary, and vectors of the same size.
I found one posting about mahout 0.6 where -x was being parsed as an absolute number of documents rather than a percentage of documents. That was supposed to have been fixed in 0.7, but I tried using it in that way too just to see if it would help. No change in the number of terms I'm getting. Here are the values I've tried, and the number of terms I've ended up with. My data set is 4850 wikipedia articles from: http://dumps.wikimedia.org/enwiki/20110803/
The exact file is: pages-articles1.xml.bz2
The xml file was turned into a seqfile with:
mahout seqwiki -all -i <path to xml file> -o <path to output directory>
My calls to seq2sparse look like this:
mahout seq2sparse -i <seq directory> -o <out dir> -ow -wt tfidf -x 4800 -nv
My results:
|-x value| #of terms |
|4800 | 256623 |
|4600 | 256623 |
|2500 | 256623 |
|99 | 256623 |
|90 | 256623 |
|25 | 256623 |
|5 | 256623 |
Any ideas on what I'm doing wrong?
I ended up asking this question on the mahout user mailing list and got an answer. I'll reproduce it here for anybody wondering the same thing I was:
Dave Byrne - "maxDFPercent won't actually remove the terms from the dictionary, or reduce the size of the tfidf vectors. It simply sets the value of the vector to 0 for that term.
In other words, the dictionary size and vector length will remain the same, with fewer non-zero terms."