Search code examples
clojuredata-mining

large scale data mining with clojure


I'm looking for a good reference on

large scale data mining with Clojure

I know of many good clojure programming books (Programming Clojure, Joy of Clojure, ...), and many good data mining text books (mining of massive data sets, managing gigabytes, ...). However I'm not aware of any reference that specifically addresses

large scale data mining with Clojure

The "with clojure" part is rather important to me for the following reasons:

* most theoretical analysis uses big-Oh running time, which ignores constants
* constants matter, if it ends up being a matter of 1 second vs 1 hour (for things that need to be real time)
* or 1 hour vs 1 week (for batch jobs)

In particular, I think there's a lot of interplay between the JVM, Clojure Data Structures, whether data is stored in memory or lazily read from disk -- that can have the "same" algorithm have drastically different running times by "slightly" different implementations.

Thus, my question (all of the above was to avoid being closed by "Check Google"):

what is a good resource on massive data mining with Clojure?

Thanks!


Solution

  • I don't think anyone's yet written a good comprehensive reference. But there is certainly lots of work going on in this space (my own company included!)

    Some interesting links to follow up:

    • Storm - distributed realtime computation using Clojure. Could be used for large scale data mining.
    • http://www.infoq.com/presentations/Why-Prismatic-Goes-Faster-With-Clojure - interesting video regarding Clojure performance and optimisation for machine learning applications
    • Incanter - probably the leading Clojure library for statistics and data visualisation
    • Weka - very comprehensive data mining / machine learning library for Java (and hence very easy to use directly from Clojure)