Search code examples
machine-learningmahoutrecommendation-engine

Mahout - JPA integration. Do I need a CSV file?


I have an existing data model using openJPA and I am trying to integrate a CF system using Mahout.

Forgive me if this is a bone head question, but I just started researching mahout. Mahout in action is in the mail, so I should be up to speed soon.

My question is how to integrate mahout with an existing jpa model. Do I need to provide a CSV file to the DataModel class, or can I extend DataModel to read directly from my existing dataSource. I realize it wouldn't be very complicated to generate a CSV file from my data, but doing this seems to be an unnecessary intermediate step.

I am very new to the "large data set" world, so forgive my ignorance. But do most systems that use Mahout use a CSV data set? Somehow this seems strange to me.

Thanks.

Edit:

So I am reading the preview amazon provides on Mahout in Action. It seems that you can have mahout interface directly into your DB, but you do so at the cost of performance. I can't wait to get my hands on this book. Any comments or tips about this would still be very much appreciated.


Solution

  • The distributed/Hadoop stuff would read from HDFS, or Hbase or Cassandra or what have you.

    The non-distributed stuff generally reads from files, and there are some hooks to read from a database/JDBC. The source isn't all that important as the recommender model is to load it in memory anyhow.

    You can write you own DataSource, reuse GenericDataModel, or modify another implementation.