Search code examples
scalaapache-sparkelasticsearchapache-flink

Am I using the right framework?


I am new into scala / flink / spark and would have a few question. Right now scala with flink is being used.

The general idea of data flow looks like this:
csv files -> flink -> elastic-> flink(process data) -> MongoDB -> Tableau

There is huge number of logfiles which are semicolon seperated. I want to write those files into elasticsearch as my data base. (this already works)
Now various kinds of analyses are needed (f.e. consistency report / productivity report). For those reports different kinds of columns are needed.

The idea is to import the base data from elasticsearch with flink, edit the data and save it into mongodb so data visualization can be done with tableau.

Editing would consist of adding additional columns like weekday, and start/end time of different status

// +-------+-----+-----+  
// | status|date |time |  
// +-------+-----+-----+  
// | start | 1.1 |7:00 |  
// | run_a | 1.1 |7:20 |  
// | run_b | 1.1 |7:50 |  
// +-------+-----+-----+  


// +-------+-------+-------+----+  
// | status|s_time |e_time |day |  
// +-------+-------+-------+----|  
// | start | 7:00  |7:20   | MON|  
// | run_a | 7:20  |7:50   | MON|  
// | run_b | 7:50  |nextVal| MON|  
// +-------+-------+-------+----+  

After a bit of research I found that flink does not give the possibility to use elastic as a datasource. There is a github project https://github.com/mnubo/flink-elasticsearch-source-connector but it is has not been updated for over a year. This does not seem to work correctly since it gives me less hits then I would get in kibana with the same query. Are there any alternatives? Why is this not supported per default?

Are those table transformations doable with flink? Does it makes sense to do them with flink? (Because I am having a really hard time achieving them)

Am I using the right frameworks for this project? Should I switch to spark since it offers more functionality / community-projects?


Solution

  • Firstly, if your target is only treatment with log (robust search, visualization, storing) you can don't reinvent the wheel and use ELK stack you will obtain next abilities -

    • data-collection and log-parsing engine with Logstash
    • Analytics and Visualization with Kibana
    • Elasticsearch like search engine
    • Seamless integration with cloud (AWS or elastic cloud)

    But this software is shareware - you won't have access to full functionality in the free version, I can say from my personal experience - trial version is appropriate for use in production - it really makes life easier.

    If you would like to make own customized pipeline for storing, transformation and treatment logs or other files Apache Spark is superb solution for this purpose - you can use Spark like ETL solution for manipulation with everything which you want - building data pipelines is incredibly easy (read from elasticsearch --> process it --> save to mongo; take from mongo --> send to visualisation or etc.) - you can achieve speedup (with comparison to earlier version of Spark) with leveraging Spark 2.0.

    Also, there is already ready solution with the integration of Spark - Mongo - ESor you can make own via using connectors to ES and Mongo. Regarding Flink - you can use it instead of Spark but Spark is more mature technology and has a wider community. Like alternative, you can use ETL solution for fast development/prototyping data flow between systems (dragging the necessary components with the mouse) like Streamsets or NiFi.