Search code examples
pythonmongodbapache-kafkaapache-stormbigdata

Data Ingestion of GB's of data in MongoDB


I have various log files which contains millions of records . I want to push the records from from these file to mongodb,I have to normalized the data before inserting in MongoDb and use Filter on ID which is common variable in every row /record in files.

MY_MONGO_DB_SCHEMA =

 {
                    "ID" : "common in all the files",
                         "LOG_FILE_1":[{
                           # variable from LOG_FILE_1
                          "var1" : "contains the value matching with ID in 
                                   record",
                          "var2": "contains the value matching with ID in 
                                   record"
                              }],
                          "LOG_FILE_2" :[{# variable from LOG_FILE_2
                          "var3: "contains the value match with ID in record"
                          "var4": "contains the value match with ID i record"
                           }]
                             }

I have written Python script but it consumes lots of memory or it takes lots of time if i limit the usage of memory by my script. Can somebody suggest using APACHE STORM , APACHE KAFKA or anything for this type of requirements? I never used it before Kafka and storm


Solution

  • Handling a big file in program needs huge amount of memory & as you said your input is big so processing it with a single process will take huge time. You can combine storm with kafka for the given use case. I will try to explain how it can help you solving your problem -

    Storm has two part - Spout and Bolt
    Spout - It releases stream of data from source.
    Bolt - Holds your business logic, for your case normalizing records.

    Put your log file into kafka topic. Let Kafka be the source for Spout. Spout will release records as a stream that can be processed in bolt.
    For more information on Storm you can go through https://in.udacity.com/course/real-time-analytics-with-apache-storm--ud381/ course. It's a free course.
    To understand Storm's parallelism -http://www.michael-noll.com/blog/2012/10/16/understanding-the-parallelism-of-a-storm-topology/

    Hope it helps.