Does anyone know if Apache Mahout works well with text in spanish? I need to do some clustering over newspaper articles in spanish and there are not a lot of tools for doing it. I think Mahout is a cool framework to do this, but is it good working on spanish text?
Why not? You can use seq2sparse
command of bin/mahout
script and specify corresponding Lucene analyzer (org.apache.lucene.analysis.es.SpanishAnalyzer
) using the -a
option. See chapter 8 (pages 199-200...) of Mahout in Action book.
Besides this, you can also write your own analyzer, using existing ones. The book contains many examples, and you can find source code in repository.