Search code examples
algorithmerlangelixirword-count

fast and concurrent algorithm of frequency calculation in elixir


I have two big lists that their item's lengths isn't constant. Each list include millions items. And I want to count frequency of items of first list in second list!

For example:

a = [[c, d], [a, b, e]]
b = [[a, d, c], [e, a, b], [a, d], [c, d, a]]

# expected result of calculate_frequency(a, b) is %{[c, d] => 2, [a, b, e] => 1} Or [{[c, d], 2}, {[a, b, e], 1}]

Due to the large size of the lists, I would like this process to be done concurrently. So I wrote this function:

  def calculate_frequency(items, data_list) do
    items
    |> Task.async_stream(
      fn item ->
        frequency =
          data_list
          |> Enum.reduce(0, fn data_row, acc ->
            if item -- data_row == [] do
              acc + 1
            else
              acc
            end
          end)

        {item, frequency}
      end,
      ordered: false
    )
    |> Enum.reduce([], fn {:ok, merged}, merged_list -> [merged | merged_list] end)
  end

But this algorithm is slow. What should I do to make it fast?

PS: Please do not consider the type of inputs and outputs, the speed of execution is important.


Solution

  • Not sure if this fast enough and certainly it's not concurrent. It's O(m + n) where m is the size of items and n is the size of data_list. I can't find a faster concurrent way because combining the result of all the sub-processes also takes time.

    data_list
    |> Enum.reduce(%{}, fn(item, counts)-> 
      Map.update(counts, item, 1, &(&1 + 1)) 
    end)
    |> Map.take(items)
    

    FYI, doing things concurrently does not necessarily mean doing things in parallel. If you have only one CPU core, concurrency actually slows things down because one CPU core can only do one thing at a time.