As background I'm trying to make a NLP application in Racket and I arrived at the part where I have to stem the words (I also obtained their frequency).
I am using the (planet dyoo/porter-stemmer)
package in order to stem, and as an example we can write:
(map (λ(x) (list (stem (first x)) (second x)))
'(("cryed" 1)
("racketeer" 2)
("crying" 3)
("playing" 4)
("racketing" 5)
("plays" 6)
("Racket" 7)))
Which produces: '(("cry" 1) ("racket" 2) ("cry" 3) ("plai" 4) ("racket" 5) ("plai" 6) ("racket" 7))
Now my goal is to sum up the frequency for each term, aka to arrive at: '(("cry" 4) ("racket" 14) ("plai" 10))
I came up with a way to do it, but I don't like my solution:
(define (frequency string)
(map (λ(x) (list (first x) (length x)))
(group-by (λ(x) x) (string-split string))))
(define (recalculate lst)
(frequency
(string-join
(flatten
(map (λ(x) (make-list (second x) (first x))) lst)))))
Basically I retype each word as many times as it's frequency, then make a single string containing all words and finally compute the frequency again. Is there a simpler(faster) way to achieve this?
I should perhaps add that the order doesn't matter ("plai" can come up before "cry" and so on). Also I'm looking for a simpler solution because I'm gonna have to use larger datasets and I want to make this faster (I'd also be glad even if the frequency
function can be made more faster).
You could create an add-count
procedure that takes a list of counts and a new count as arguments, and adds the count to the list if there are no similarly tagged counts already in the list, or combines the new count with an existing count.
#lang racket
(define (get-tag c) (first c))
(define (get-val c) (second c))
(define (add-count cs c)
(let* ((k (get-tag c))
(v (get-val c))
(old-count (assoc k cs)))
(if old-count
(cons (list k (+ v (get-val old-count)))
(remove old-count cs))
(cons c cs))))
Here get-tag
and get-val
are just convenience procedures to access the tag and value stored in a count. The assoc
procedure is used to extract a copy of the first count in cs
matching the new count c
to be added. This count is stored in old-count
, the value of which is used to create a new count which is added to the list after removing old-count
from the original list cs
.
With the add-count
procedure defined, a procedure reduce-counts
could be defined that goes through all of the counts and accumulates them to an empty list by using add-count
. The resulting list will have the counts combined.
(define (reduce-counts cs (acc '()))
(if (null? cs)
acc
(reduce-counts (rest cs) (add-count acc (first cs)))))
Here is a test run:
reduce-counts.rkt> (define test-counts '(("cry" 1) ("racket" 2) ("cry" 3) ("play" 4) ("racket" 5) ("play" 6) ("racket" 7)))
reduce-counts.rkt> (reduce-counts test-counts)
'(("racket" 14) ("play" 10) ("cry" 4))
As an alternative approach you could use filter
to collect counts with similar tags in a list, and combine those into a new count after summing the values. The combined counts can be collected in an accumulator before filtering the input to remove the tags which were just combined. This process can be repeated recursively until all counts have been combined, removed, and collected.
;;; An alternate solution
(define (combine-like-counts cs)
(list (get-tag (first cs))
(foldl (lambda (c x) (+ x (get-val c))) 0 cs)))
(define (reduce-counts cs (acc '()))
(if (null? cs)
acc
(let* ((k (get-tag (first cs)))
(k-tag? (lambda (c) (equal? k (get-tag c))))
(like (filter k-tag? cs))
(remaining (filter (negate k-tag?) cs)))
(reduce-counts remaining
(cons (combine-like-counts like) acc)))))
Here the combine-like-counts
procedure assumes that all counts in the input list share the same tag, so a new count is formed by taking the tag and the sum of all values into a list.
The new reduce-counts
procedure returns whatever has been placed in the accumulator when the input is the empty list, otherwise the tag of the first count is saved and used to create the k-tag?
predicate, which is then used with filter
to create a list of matching counts and a list of the remaining counts with all matching counts removed. The list of matching counts is combined into a single count with combine-like-counts
and added to the accumulator, which is passed along with remaining
recursively to reduce-counts
.
This works as before, although the ordering has changed:
reduce-counts.rkt> (define test-counts '(("cry" 1) ("racket" 2) ("cry" 3) ("play" 4) ("racket" 5) ("play" 6) ("racket" 7)))
reduce-counts.rkt> (reduce-counts test-counts)
'(("play" 10) ("racket" 14) ("cry" 4))
I would suspect that these two implementations would have different performance characteristics depending on the particulars of their input data. My hunch is that the second would fare better for large input that contained large quantities of each tag, but the real answer would come from testing on some representative data samples.
If you are really concerned about performance for large amounts of data, you might consider converting the data to a hash table and using some of the built-in dictionary procedures to arrive at a similar solution.