Search code examples
javaapachemahout

apache mahout for text in spanish


Does anyone know if Apache Mahout works well with text in spanish? I need to do some clustering over newspaper articles in spanish and there are not a lot of tools for doing it. I think Mahout is a cool framework to do this, but is it good working on spanish text?


Solution

  • Why not? You can use seq2sparse command of bin/mahout script and specify corresponding Lucene analyzer (org.apache.lucene.analysis.es.SpanishAnalyzer) using the -a option. See chapter 8 (pages 199-200...) of Mahout in Action book.

    Besides this, you can also write your own analyzer, using existing ones. The book contains many examples, and you can find source code in repository.