Search code examples
javabigdatabusiness-intelligenceapache-kafka

How can Kafka limitations be avoided?


We're trying to build a BI system that will collect very large amounts of data that should be processed by other components.
We decided that it will be a good idea to have an intermediate layer to collect, store & distribute the data.

The data is represented by a big set of log messages. Each log message has:

  • a product
  • an action type
  • a date
  • message payload

System specifics:

  • average: 1.5 million messages / minute
  • peak: 15 million messages / minute
  • the average message size is: 700 bytes (aprox 1.3TB / day)
  • we have 200 products
  • we have 1100 action types
  • the data should be ingested every 5 minutes
  • the consumer applications usually need 1-2-3 product with 1-2-3 action types (we need fast access for 1 product / 1 action type)

We were thinking that Kafka would do this job but we encountered several problems.
We tried to create a topic for each action type and a partition for each product. By doing this we could be able to extract 1 product / 1 action type to be consumed.

Initially we had a problem with "too many opened files", but after we changed the server config to support more files we're getting out-of-memory error (12GB allocated / node)
Also, we had problems with Kafka stability. At a big number of topics, kafka tends to freeze.

Our questions:

  • Is Kafka suitable for our use-case scenario? Can it support such a big number of topics / partitions?
  • Can we organize the data in Kafka in another way to avoid this problems but still to be able to have a good access speed for 1 product / 1 action type?
  • Do you recommend other Kafka alternatives that are better suitable for this?

Solution

  • I'm posting this answer so that other users can see the solution we adopted.

    Due to Kafka limitations (the large no. of partitions which cause the OS to reach almost reach max open files) and somewhat weak performance we decided to build a custom framework for exactly our needs using libraries like apache commons , guava, trove etc to achieve the performance we needed.

    The entire system (distributed and scalable) has 3 main parts:

    1. ETL (reads the data , process it and writes it to binary files)

    2. Framework Core (used to read from the binary files and calculate stats)

    3. API (used by many system to get data for display)

    As a side note: we tried other solutions like HBase, Storm etc but none live up to our needs.