Using Filter on vectors

I am trying to use the filter function over a vector called dataset that is defined like so:

AK,0.89,0.98
AR,0.49,0.23
AN,0.21,0.78
...

And I want to get all of the values that contain a certain string, something like this:

(filter (contains "AK") dataset)

Which would return:

AK,0.89,0.98

Is it possible to do this using the filter function? I already iterate over the vector using doseq but I'm required to use filter at some point in my code. Thanks :)

Solution

The basic answer is yes, you can use filter to do this. Filter expects a predicate function i.e. a function which returns true or false. The filter function will iterate over the elements in the collection you pass in and pass each element from that collection to the predicate. What you do inside the predicate function is totally up to you (though you should make sure to avoid side effects). Filter will collect all the elements where the predicate returned true into a new lazy sequence.

Essentially, you have (long form)

(filter (fn [element] 
         ; some test returning true/fals) col)

where col is your collection. The result will be a LAZY SEQUENCE of elements where the predicate function returned true. It is important to understand that things like filter and map return lazy sequences and know what that really means.

The critical bit to understand is the structure of your collection. In your description, you stated

I am trying to use the filter function over a vector called dataset that is defined like so:

AK,0.89,0.98 AR,0.49,0.23 AN,0.21,0.78 ...

Unfortunately, your description is a little ambiguous. If your dataset structure is actually a vector of vectors (not simply a vector), then things are very straight-forward. This is because it will mean that each 'element' passed to the predicate function will be one of your 'inner' vectors. The real definition is more accurately represented as

[
 [AK,0.89,0.98]
 [AR,0.49,0.23]
 [AN,0.21,0.78]
 ...
]

what will be passed to the predicate is a vector of 3 elements. If you just want to select all the vectors where the first element is 'AK', then the predicate function could be as simple as

(fn [el]
 (if (= "AK" (first el))
   true;
   false))

So the full line would be something like

(filter (fn [el]
         (if (= "AK" (first el))
           true
           false)) [[AK 0.89 0.98] [AR 0.49 0.23] [AN 0.21 0.78]])

and that is just the start and very verbose version. There is lots you can do to make this even shorter e.g.

(filter #(= "AK" (first %)) [..])

If on the other hand, you really do just have a single vector, then things become a little more complicated because you would need to somehow group the values. This could be done by using the partition function to break up your vector into groups of 3 items before passing them to filter e.g.

(filter pred (partition 3 col))

which would group the elements in your original vector into groups of 3 and pass each group to the predicate function. This is where the real power of map, filter, reduce, etc come into play - you can transform the data, passing it through a pipeline of functions, each of which somehow manipulates the data and a final result pops out the end.

The key point is to understand what filter (and other functions like this, such as map or reduce) will understand as an 'element' in your input collection. Basically, this is the same as what would be returned by 'first' called on the collection. This is what is passed to the predicate function in fileter.

There are a lot of assumptions here. One of the main ones is that your data is strictly ordered i.e. the value you are looking to test is always the first element in each group. If this is not the case, then more work will need to be done. Likewise, we assume the data is always in groups of 3. If it isn't, then other approaches will be needed.