Search code examples
javaapache-sparkcassandraapache-kafkaapache-storm

Java Huge csv file processing and storing using Apache Spark/ Kafka/ Storm to Cassandra


I am working on a requirement where I need to read sensor data from csv/tsv and insert into Cassandra db.

CSV Format:

sensor1 timestamp1 value
sensor1 timestamp2 value
sensor2 timestamp1 value
sensor2 timestamp3 value

Details:

User can upload a file to our web application. Once the file is uploaded, I need to display unique values from a column to User in the next page. For example ->

  1. sensor1 node1
  2. sensor2 node2
  3. sensorn create

User can either map a sensor1 with existing primary key called node1, in this case timestamps and values for sensor1 will be added to a table where primary key is equal to node1 or create primary key, in this case timestamps and values will be added with the new primary key.

I was able to implement this using Java8 streaming and collection. This is working with small csv file.

Question:

  1. How can I upload huge csv/ tsv file (200 gb) to my web application? Shall I upload the file in HDFS and specify the path in UI? I have even split the huge file into small chunks (50 MB each).

  2. How can I get unique values from first column? Can I use Kafka/ spark here? I need to insert timestamp/ value to Cassandra db. Again Can I use Kafka/ Spark here?

Any help is highly appreciated.


Solution

  • How can I upload huge csv/ tsv file (200 gb) to my web application? Shall I upload the file in HDFS and specify the path in UI? I have even split the huge file into small chunks (50 MB each).

    Depends on how your web app is going to be used. Uploading a file of such a huge size during the context of a HTTP request from a client to the server is always going to be tricky. You have to do it asynchronously. Whether you put that in HDFS or S3 or even a simple SFTP server is a matter of design choice and that choice will affect what kinds of tools you want to build around the file. I would suggest start with something simple like FTP/NAS and as you have needs to scale, you could use something like S3. (Using HDFS as a shared file storage is something I haven't seen many people do, but that shouldn't prohibit you from trying)

    How can I get unique values from first column? Can I use Kafka/ spark here? I need to insert timestamp/ value to Cassandra db. Again Can I use Kafka/ Spark here?

    Spark batch or even a normal M/R job would do the trick for you. This is just a simple groupBy operation, though you should really look at how far you are willing to sacrifice on the latency, as groupBy operations are generally costly (it involves shuffles). Generally, from my limited experience, using streaming for use-cases is slightly overkill, unless you get a continuous stream of source data. But the way you have described your use-case looks more a batch candidate for me.

    Some things I would focus on: how do I transfer my file from the client app, what are my end-to-end SLAs for availability of data in Cassandra, what happens when there are failures (do we retry, etc.), how often my jobs will be run (will it be triggered every time user uploads the file or it can be a cron job), etc.