Search code examples
clojureetl

Clojure CSV Matching Files


I have 2 csv files with around 22K records in file F1, and 50K records in file F2, both containing company name, and address information. I need to do a fuzzy-match on name, address, and phone. Each record in F1 needs to be fuzzy-matched against each record in F2. I have made a third file R3 which is a csv containing the rules for fuzzy-matching which column from F1 to which column with F2, with a fuzzy-tolerance-level. I am trying to do it with for loop, this way -

(for [j f1-rows
      h f2-rows
      r r3-rows
      :while (match-row j h r)]
  (merge j h))

(defn match-row [j h rules]
  (every?
   identity
   (map (fn [rule]
          (<= (fuzzy/jaccard
           ((keyword (first rule)) j)
           ((keyword (second rule)) h))
          ((nth rule 2))))
    rules)))

f1-rows and f2-rows are collections of map. Rules is a collection of sequences containing column name from f1, f2, and the tolerance level. The code is running and functioning as expected. But my problem is, it is taking around 2 hours to execute. I read up on how transducers help improving performance by eliminating intermediate chunks, but I am not able to visualize how I would apply that in my case. Any thoughts on how I can make this better/faster ?


Solution

  • :while vs :when

    Your use of :while in this case doesn't seem to agree with your stated problem. Your for-expression will keep going while match-row is true, and stop altogether at the first false result. :when will iterate through all the combinations and include only ones where match-row is true in the resulting lazy-seq. The difference is explained here.

    For example:

    (for [i (range 10)
          j (range 10)
          :while (= i j)]
      [i j]) ;=> ([0 0])
    
    (for [i (range 10)
          j (range 10)
          :when (= i j)]
      [i j]) ;=> ([0 0] [1 1] [2 2] [3 3] [4 4] [5 5] [6 6] [7 7] [8 8] [9 9])
    

    It's really strange that your code kept running for 2 hours, because that means that during those two hours every invocation of (match-row j h r) returned true, and only the last one returned false. I would check the result again to see if it really makes sense.

    How long should it take?

    Let's first do some back-of-the-napkin math. If you want to compare every one of 22k records with every one of 55k records, you're gonna be doing 22k * 55k comparisons, there's no way around that.

    22k * 55k = 1,210,000,000

    That's a big number!

    What's the cost of a comparison?

    From a half-a-minute glance at wikipedia, jaccard is something about sets. The following will do to get a ballpark estimate of the cost, though it's probably very much on the low end.

    (time (clojure.set/difference (set "foo") (set "bar")))
    

    That takes about a tenth of a millisecond on my computer.

    (/ (* 22e3 55e3) ;; Number of comparisons.
       10 ; milliseconds
       1000 ;seconds
       60 ;minutes
       60) ;hours
    ;=> 33.611111111111114
    

    That's 33 and a half hours. And that's with a low-end estimate of the individual cost, and not counting the fact that you want to compare name, address and phone on each one (?). So that' 33 hours if every comparison fails at the first row, and 99 hours if they all get to the last row.

    Before any micro-optimizations, you need to work on the algorithm, by finding some clever way of not needing to do more than a billion comparisons. If you want help on that, you need to at least supply some sample data.

    Nitpicks

    The indentation of the anon fn inside match-row is confusing. I'd use an editor that indents automatically, and stick to that 99% of the time, because lisp programmers read nesting by the indentation, like python. There are some slight differences between editors/autoindenters, but they're all consistent with the nesting.

    (defn match-row [j h rules]
      (every?
       identity
       (map (fn [rule]
              (<= (fuzzy/jaccard
                   ((keyword (first rule)) j)
                   ((keyword (second rule)) h))
                  ((nth rule 2))))
            rules)))
    

    Also,match-row needs to be defined before it is used (which it probably is in your actual code, seeing as it compiles).