Search code examples
hadoopdata-warehousedata-lake

Building Data Lake from scratch


I am trying to build a "Data Lake" from scratch. I understand how a data lake works and the purpose of it; it's all over the internet. But when the question arises how to build one from scratch there is no source. I want to understand if:

Data warehouse + Hadoop = Data Lake

I know how to run Hadoop and bring in data into Hadoop. I want to build a sample on premise data lake to demo my manager. Any help is appreciated.


Solution

  • You'd have to have structured and unstructured data to make a Hadoop cluster into a data lake.

    So, you'd have to have some ETL pipeline taking the unstructured data and converting it to structured data. Product reviews or something similar would provide your unstructured data. Converting this to something usable by Hive (as an example) would give you your structured data.

    I would look at https://opendata.stackexchange.com/ for getting your data and google Hadoop ETL for ideas on how to cleanse the data. It's up to you how you want to write your pipeline (Spark or MapReduce).