We're trying to build a BI system that will collect very large amounts of data that should be processed by other components.
We decided that it will be a good idea to have an intermediate layer to collect, store & distribute the data.
The data is represented by a big set of log messages. Each log message has:
System specifics:
We were thinking that Kafka would do this job but we encountered several problems.
We tried to create a topic for each action type and a partition for each product. By doing this we could be able to extract 1 product / 1 action type to be consumed.
Initially we had a problem with "too many opened files", but after we changed the server config to support more files we're getting out-of-memory error (12GB allocated / node)
Also, we had problems with Kafka stability. At a big number of topics, kafka tends to freeze.
Our questions:
I'm posting this answer so that other users can see the solution we adopted.
Due to Kafka limitations (the large no. of partitions which cause the OS to reach almost reach max open files) and somewhat weak performance we decided to build a custom framework for exactly our needs using libraries like apache commons , guava, trove etc to achieve the performance we needed.
The entire system (distributed and scalable) has 3 main parts:
ETL (reads the data , process it and writes it to binary files)
Framework Core (used to read from the binary files and calculate stats)
As a side note: we tried other solutions like HBase, Storm etc but none live up to our needs.