Search code examples
nlpracket

How to sum up the word frequencies after stemming in Racket?


As background I'm trying to make a NLP application in Racket and I arrived at the part where I have to stem the words (I also obtained their frequency).

I am using the (planet dyoo/porter-stemmer) package in order to stem, and as an example we can write:

(map (λ(x) (list (stem (first x)) (second x)))
     '(("cryed" 1)
       ("racketeer" 2)
       ("crying" 3)
       ("playing" 4)
       ("racketing" 5)
       ("plays" 6)
       ("Racket" 7)))

Which produces: '(("cry" 1) ("racket" 2) ("cry" 3) ("plai" 4) ("racket" 5) ("plai" 6) ("racket" 7))

Now my goal is to sum up the frequency for each term, aka to arrive at: '(("cry" 4) ("racket" 14) ("plai" 10))

I came up with a way to do it, but I don't like my solution:

(define (frequency string)
  (map (λ(x) (list (first x) (length x)))
       (group-by (λ(x) x) (string-split string))))

(define (recalculate lst)
  (frequency
   (string-join
    (flatten
     (map (λ(x) (make-list (second x) (first x))) lst)))))

Basically I retype each word as many times as it's frequency, then make a single string containing all words and finally compute the frequency again. Is there a simpler(faster) way to achieve this?

I should perhaps add that the order doesn't matter ("plai" can come up before "cry" and so on). Also I'm looking for a simpler solution because I'm gonna have to use larger datasets and I want to make this faster (I'd also be glad even if the frequency function can be made more faster).


Solution

  • You could create an add-count procedure that takes a list of counts and a new count as arguments, and adds the count to the list if there are no similarly tagged counts already in the list, or combines the new count with an existing count.

    #lang racket
    
    (define (get-tag c) (first c))
    
    (define (get-val c) (second c))
    
    (define (add-count cs c)
      (let* ((k (get-tag c))
             (v (get-val c))
             (old-count (assoc k cs)))
        (if old-count
            (cons (list k (+ v (get-val old-count)))
                  (remove old-count cs))
            (cons c cs))))
    

    Here get-tag and get-val are just convenience procedures to access the tag and value stored in a count. The assoc procedure is used to extract a copy of the first count in cs matching the new count c to be added. This count is stored in old-count, the value of which is used to create a new count which is added to the list after removing old-count from the original list cs.

    With the add-count procedure defined, a procedure reduce-counts could be defined that goes through all of the counts and accumulates them to an empty list by using add-count. The resulting list will have the counts combined.

    (define (reduce-counts cs (acc '()))
      (if (null? cs)
          acc
          (reduce-counts (rest cs) (add-count acc (first cs)))))
    

    Here is a test run:

    reduce-counts.rkt> (define test-counts '(("cry" 1) ("racket" 2) ("cry" 3) ("play" 4) ("racket" 5) ("play" 6) ("racket" 7)))
    reduce-counts.rkt> (reduce-counts test-counts)
    '(("racket" 14) ("play" 10) ("cry" 4))
    

    As an alternative approach you could use filter to collect counts with similar tags in a list, and combine those into a new count after summing the values. The combined counts can be collected in an accumulator before filtering the input to remove the tags which were just combined. This process can be repeated recursively until all counts have been combined, removed, and collected.

    ;;; An alternate solution
    (define (combine-like-counts cs)
      (list (get-tag (first cs))
            (foldl (lambda (c x) (+ x (get-val c))) 0 cs)))
    
    (define (reduce-counts cs (acc '()))
      (if (null? cs)
          acc
          (let* ((k (get-tag (first cs)))
                 (k-tag? (lambda (c) (equal? k (get-tag c))))
                 (like (filter k-tag? cs))
                 (remaining (filter (negate k-tag?) cs)))
            (reduce-counts remaining
                           (cons (combine-like-counts like) acc)))))
    

    Here the combine-like-counts procedure assumes that all counts in the input list share the same tag, so a new count is formed by taking the tag and the sum of all values into a list.

    The new reduce-counts procedure returns whatever has been placed in the accumulator when the input is the empty list, otherwise the tag of the first count is saved and used to create the k-tag? predicate, which is then used with filter to create a list of matching counts and a list of the remaining counts with all matching counts removed. The list of matching counts is combined into a single count with combine-like-counts and added to the accumulator, which is passed along with remaining recursively to reduce-counts.

    This works as before, although the ordering has changed:

    reduce-counts.rkt> (define test-counts '(("cry" 1) ("racket" 2) ("cry" 3) ("play" 4) ("racket" 5) ("play" 6) ("racket" 7)))
    reduce-counts.rkt> (reduce-counts test-counts)
    '(("play" 10) ("racket" 14) ("cry" 4))
    

    I would suspect that these two implementations would have different performance characteristics depending on the particulars of their input data. My hunch is that the second would fare better for large input that contained large quantities of each tag, but the real answer would come from testing on some representative data samples.

    If you are really concerned about performance for large amounts of data, you might consider converting the data to a hash table and using some of the built-in dictionary procedures to arrive at a similar solution.