Search code examples
clojuredestructuring

Complex data manipulation in Clojure


I'm working on a personal market analysis project. I've got a data structure representing all the recent turning points in the market, that looks like this:

[{:high 1.121455, :time "2016-08-03T05:15:00.000000Z"}
 {:low 1.12109, :time "2016-08-03T05:15:00.000000Z"}
 {:high 1.12173, :time "2016-08-03T04:30:00.000000Z"}
 {:high 1.121925, :time "2016-08-03T00:00:00.000000Z"}
 {:high 1.12215, :time "2016-08-02T23:00:00.000000Z"}
 {:high 1.12273, :time "2016-08-02T21:15:00.000000Z"}
 {:high 1.12338, :time "2016-08-02T18:15:00.000000Z"}
 {:low 1.119215, :time "2016-08-02T12:30:00.000000Z"}
 {:low 1.118755, :time "2016-08-02T12:00:00.000000Z"}
 {:low 1.117575, :time "2016-08-02T06:00:00.000000Z"}
 {:low 1.117135, :time "2016-08-02T04:30:00.000000Z"}
 {:low 1.11624, :time "2016-08-02T02:00:00.000000Z"}
 {:low 1.115895, :time "2016-08-01T21:30:00.000000Z"}
 {:low 1.11552, :time "2016-08-01T11:45:00.000000Z"}
 {:low 1.11049, :time "2016-07-29T12:15:00.000000Z"}
 {:low 1.108825, :time "2016-07-29T08:30:00.000000Z"}
 {:low 1.10839, :time "2016-07-29T08:00:00.000000Z"}
 {:low 1.10744, :time "2016-07-29T05:45:00.000000Z"}
 {:low 1.10716, :time "2016-07-28T19:30:00.000000Z"}
 {:low 1.10705, :time "2016-07-28T18:45:00.000000Z"}
 {:low 1.106875, :time "2016-07-28T18:00:00.000000Z"}
 {:low 1.10641, :time "2016-07-28T05:45:00.000000Z"}
 {:low 1.10591, :time "2016-07-28T01:45:00.000000Z"}
 {:low 1.10579, :time "2016-07-27T23:15:00.000000Z"}
 {:low 1.105275, :time "2016-07-27T22:00:00.000000Z"}
 {:low 1.096135, :time "2016-07-27T18:00:00.000000Z"}]

Conceptually, I want to match up :high/:low pairs, work out the price range (high-low) and midpoint (average of high & low), but I don't want every possible pair to be generated.

What I want to do is start from the 1st item in the collection {:high 1.121455, :time "2016-08-03T05:15:00.000000Z"} and walk "down" through the remainder of the collection, creating a pair with every :low item UNTIL I hit the next :high item. Once I hit that next :high item, I'm not interested in any further pairs. In this case, there's only a single pair created, which is the :high and the 1st :low - I stop there because the next (3rd) item is a :high. The 1 generated record should look like {:price-range 0.000365, :midpoint 1.121272, :extremes [{:high 1.121455, :time "2016-08-03T05:15:00.000000Z"}{:low 1.12109, :time "2016-08-03T05:15:00.000000Z"}]}

Next, I'd move onto the 2nd item in the collection {:low 1.12109, :time "2016-08-03T05:15:00.000000Z"} and walk "down" through the remainder of the collection, creating a pair with every :high item UNTIL I hit the next :low item. In this case, I get 5 new records generated, being the :low and the next 5 :high items which are all consecutive; the first of these 5 records would look like

{:price-range 0.000064, :midpoint 1.12131, :extremes [{:low 1.12109, :time "2016-08-03T05:15:00.000000Z"}{:high 1.12173, :time "2016-08-03T04:30:00.000000Z"}]}

the second of these 5 records would look like

{:price-range 0.000835, :midpoint 1.1215075, :extremes [{:low 1.12109, :time "2016-08-03T05:15:00.000000Z"}{:high 1.121925, :time "2016-08-03T00:00:00.000000Z"}]}

and so on.

After that, I get a :low so I stop there.

Then I'd move onto the 3rd item {:high 1.12173, :time "2016-08-03T04:30:00.000000Z"} and walk "down" creating pairs with every :low UNTIL I hit the next :high. In this case, I get 0 pairs generated, because the :high is followed immediately by another :high. Same for the next 3 :high items, which are all followed immediately by another :high

Next I get to the 7th item {:high 1.12338, :time "2016-08-02T18:15:00.000000Z"} and that should generate a pair with each of the following 20 :low items.

My generated result would be a list of all the pairs created:

[{:price-range 0.000365, :midpoint 1.121272, :extremes [{:high 1.121455, :time "2016-08-03T05:15:00.000000Z"}{:low 1.12109, :time "2016-08-03T05:15:00.000000Z"}]}
 {:price-range 0.000064, :midpoint 1.12131, :extremes [{:low 1.12109, :time "2016-08-03T05:15:00.000000Z"}{:high 1.12173, :time "2016-08-03T04:30:00.000000Z"}]}
 ...

If I was implementing this using something like Python, I'd probably use a couple of nested loops, use a break to exit the inner loop when I stopped seeing :highs to pair with my :low and vice-versa, and accumulate all the generated records into an array as I traversed the 2 loops. I just can't work out a good way to attack it using Clojure...

Any ideas?


Solution

  • first of all you can rephrase this the following way:

    1. you have to find all the boundary points, where :high is followed by :low, or vice versa
    2. you need to take the item before the bound, and make something with it and every item after bound, but until the next switching bound.

    for the simplicity let's use the following data model:

    (def data0 [{:a 1} {:b 2} {:b 3} {:b 4} {:a 5} {:a 6} {:a 7}])
    

    the first part can be achieved by using partition-by function, that splits the input collection every time the function changes it's value for the processed item:

    user> (def step1 (partition-by (comp boolean :a) data0))
    #'user/step1
    user> step1
    (({:a 1}) ({:b 2} {:b 3} {:b 4}) ({:a 5} {:a 6} {:a 7}))
    

    now you need to take every two of these groups and manipulate them. the groups should be like this: [({:a 1}) ({:b 2} {:b 3} {:b 4})] [({:b 2} {:b 3} {:b 4}) ({:a 5} {:a 6} {:a 7})]

    this is achieved by the partition function:

    user> (def step2 (partition 2 1 step1))
    #'user/step2
    user> step2
    ((({:a 1}) ({:b 2} {:b 3} {:b 4})) 
     (({:b 2} {:b 3} {:b 4}) ({:a 5} {:a 6} {:a 7})))
    

    you have to do something for every pair of groups. You could do it with map:

    user> (def step3 (map (fn [[lbounds rbounds]]
                        (map #(vector (last lbounds) %)
                             rbounds))
                      step2))
    #'user/step3
    user> step3
    (([{:a 1} {:b 2}] [{:a 1} {:b 3}] [{:a 1} {:b 4}]) 
     ([{:b 4} {:a 5}] [{:b 4} {:a 6}] [{:b 4} {:a 7}]))
    

    but since you need the concatenated list, rather then the grouped one, you would want to use mapcat instead of map:

    user> (def step3 (mapcat (fn [[lbounds rbounds]]
                               (map #(vector (last lbounds) %)
                                    rbounds))
                             step2))
    #'user/step3
    user> step3
    ([{:a 1} {:b 2}] 
     [{:a 1} {:b 3}] 
     [{:a 1} {:b 4}] 
     [{:b 4} {:a 5}] 
     [{:b 4} {:a 6}] 
     [{:b 4} {:a 7}])
    

    that's the result we want (it almost is, since we just generate vectors, instead of maps).

    now you could prettify it with the threading macro:

    (->> data0
         (partition-by (comp boolean :a))
         (partition 2 1)
         (mapcat (fn [[lbounds rbounds]]
                   (map #(vector (last lbounds) %)
                        rbounds))))
    

    which gives you exactly the same result.

    applied to your data it would look almost the same (with another result generating fn)

    user> (defn hi-or-lo [item]
            (item :high (item :low)))
    #'user/hi-or-lo
    user> 
    (->> data
         (partition-by (comp boolean :high))
         (partition 2 1)
         (mapcat (fn [[lbounds rbounds]]
                   (let [left-bound (last lbounds)
                         left-val (hi-or-lo left-bound)]
                     (map #(let [right-val (hi-or-lo %)
                                 diff (Math/abs (- right-val left-val))]
                             {:extremes [left-bound %]
                              :price-range diff
                              :midpoint (+ (min right-val left-val)
                                           (/ diff 2))})
                          rbounds))))
         (clojure.pprint/pprint))
    

    it prints the following:

    ({:extremes
      [{:high 1.121455, :time "2016-08-03T05:15:00.000000Z"}
       {:low 1.12109, :time "2016-08-03T05:15:00.000000Z"}],
      :price-range 3.6500000000017074E-4,
      :midpoint 1.1212725}
     {:extremes
      [{:low 1.12109, :time "2016-08-03T05:15:00.000000Z"}
       {:high 1.12173, :time "2016-08-03T04:30:00.000000Z"}],
      :price-range 6.399999999999739E-4,
      :midpoint 1.12141}
     {:extremes
      [{:low 1.12109, :time "2016-08-03T05:15:00.000000Z"}
       {:high 1.121925, :time "2016-08-03T00:00:00.000000Z"}],
      :price-range 8.350000000001412E-4,
      :midpoint 1.1215074999999999}
     {:extremes
      [{:low 1.12109, :time "2016-08-03T05:15:00.000000Z"}
       {:high 1.12215, :time "2016-08-02T23:00:00.000000Z"}],
      :price-range 0.001060000000000061,
      :midpoint 1.12162}
     {:extremes
      [{:low 1.12109, :time "2016-08-03T05:15:00.000000Z"}
       {:high 1.12273, :time "2016-08-02T21:15:00.000000Z"}],
      :price-range 0.0016400000000000858,
      :midpoint 1.12191}
     {:extremes
      [{:low 1.12109, :time "2016-08-03T05:15:00.000000Z"}
       {:high 1.12338, :time "2016-08-02T18:15:00.000000Z"}],
      :price-range 0.0022900000000001253,
      :midpoint 1.1222349999999999}
     {:extremes
      [{:high 1.12338, :time "2016-08-02T18:15:00.000000Z"}
       {:low 1.119215, :time "2016-08-02T12:30:00.000000Z"}],
      :price-range 0.004164999999999974,
      :midpoint 1.1212975}
     {:extremes
      [{:high 1.12338, :time "2016-08-02T18:15:00.000000Z"}
       {:low 1.118755, :time "2016-08-02T12:00:00.000000Z"}],
      :price-range 0.004625000000000101,
      :midpoint 1.1210675}
     ...
    

    As an answer the question about "complex data manipulation", i would advice you to look through all the collections' manipulating functions from the clojure core, and then try to decompose any task to the application of those. There are not so many cases when you need something beyond them.