Search code examples
clojurefunctional-programmingstack-overflowbigdataword-frequency

clojure frequency dictionary from big data


I want to write my own naive bayes classifier I have a file like this:

(This is database of spam and ham messages, first word points to spam or ham, text until eoln is message (size: 0.5 Mb) from here http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/)

ham     Go until jurong point, crazy.. Available only in bugis n gre
at world la e buffet... Cine there got amore wat...
ham     Ok lar... Joking wif u oni...
spam    Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's
ham     U dun say so early hor... U c already then say...
ham     Nah I don't think he goes to usf, he lives around here though
spam    FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, £1.50 to rcv

And i want to make a hashmap like this: {"spam" {"go" 1, "until" 100, ...}, "ham" {......}} Hash map, where every value is frequency map of words (for ham and spam separatly)

I know, how do it by python or c++, and i made it by clojure, but my solution failed (stackoverflow) in large data

My solution:

(defn read_data_from_file [fname]
    (map #(split % #"\s")(map lower-case (with-open [rdr (reader fname)] 
        (doall (line-seq rdr))))))

(defn do-to-map [amap keyseq f]
    (reduce #(assoc %1 %2 (f (%1 %2))) amap keyseq))

(defn dicts_from_data [raw_data]
    (let [data (group-by #(first %) raw_data)]
        (do-to-map
            data (keys data) 
                (fn [x] (frequencies (reduce concat (map #(rest %) x)))))))

I've tryed to find where it false and wrote this

(def raw_data (read_data_from_file (first args)))
(def d (group-by #(first %) raw_data))
(def f (map frequencies raw_data))
(def d1 (reduce concat (d "spam")))
(println (reduce concat (d "ham")))

Error:

Exception in thread "main" java.lang.RuntimeException: java.lang.StackOverflowError
    at clojure.lang.Util.runtimeException(Util.java:165)
    at clojure.lang.Compiler.eval(Compiler.java:6476)
    at clojure.lang.Compiler.eval(Compiler.java:6455)
    at clojure.lang.Compiler.eval(Compiler.java:6431)
    at clojure.core$eval.invoke(core.clj:2795)
    at clojure.main$eval_opt.invoke(main.clj:296)
    at clojure.main$initialize.invoke(main.clj:315)
.....

Can anyone help me to make this better/effective? PS Sorry for my writing mistakes. English in not my native language.


Solution

  • Using apply instead of reduce in the anonymous function avoids the StackOverflow exception. Instead of (fn [x] (frequencies (reduce concat (map #(rest %) x)))) use (fn [x] (frequencies (apply concat (map #(rest %) x)))).

    The following is the same code a little refactored, but with the exact same logic. The read-data-from-file was changed to avoid mapping over the sequence of lines twice.

    (use 'clojure.string)
    (use 'clojure.java.io)
    
    (defn read-data-from-file [fname]
      (let [lines (with-open [rdr (reader fname)] 
                    (doall (line-seq rdr)))]
        (map #(-> % lower-case (split #"\s")) lines)))
    
    (defn do-to-map [m keyseq f]
        (reduce #(assoc %1 %2 (f (%1 %2))) m keyseq))
    
    (defn process-words [x]
      (->> x 
        (map #(rest %)) 
        (apply concat) ; This is the only real change from the 
                       ; original code, it used to be (reduce concat).
        frequencies))
    
    (defn dicts-from-data [raw_data]
      (let [data (group-by first raw_data)]
        (do-to-map data
                   (keys data) 
                   process-words)))
    
    (-> "SMSSpamCollection.txt" read-data-from-file dicts-from-data keys)