I have an existing data model using openJPA and I am trying to integrate a CF system using Mahout.
Forgive me if this is a bone head question, but I just started researching mahout. Mahout in action is in the mail, so I should be up to speed soon.
My question is how to integrate mahout with an existing jpa model. Do I need to provide a CSV file to the DataModel class, or can I extend DataModel to read directly from my existing dataSource. I realize it wouldn't be very complicated to generate a CSV file from my data, but doing this seems to be an unnecessary intermediate step.
I am very new to the "large data set" world, so forgive my ignorance. But do most systems that use Mahout use a CSV data set? Somehow this seems strange to me.
Thanks.
Edit:
So I am reading the preview amazon provides on Mahout in Action. It seems that you can have mahout interface directly into your DB, but you do so at the cost of performance. I can't wait to get my hands on this book. Any comments or tips about this would still be very much appreciated.
The distributed/Hadoop stuff would read from HDFS, or Hbase or Cassandra or what have you.
The non-distributed stuff generally reads from files, and there are some hooks to read from a database/JDBC. The source isn't all that important as the recommender model is to load it in memory anyhow.
You can write you own DataSource
, reuse GenericDataModel
, or modify another implementation.