Search code examples
amazon-s3amazon-ec2hdfsdata-lake

What is the difference between a data lake with HDFS or S3 in AWS?


I need to build a data lake on AWS, but I don't know how exactly S3 is different from HDFS. I found some answers in the Internet but I still don't understand the real difference.

I also need to know if someone has the data lake architecture of HDFS and S3 in AWS.


Solution

  • HDFS is only accessible to the Hadoop cluster in which it exists. If the cluster turns off or is terminated, the data in HDFS will be gone.

    Data in Amazon S3:

    • Remains available at all times (it cannot be 'turned off')
    • Is accessible to multiple clusters
    • Is accessible to other AWS services, such as Amazon Athena (which is 'Presto as a service', so you might not even need a Hadoop cluster)
    • Has multiple storage classes, such as storing less-frequently accessed data at a lower cost
    • Does not have storage limits (while HDFS is limited to the storage available in the Hadoop cluster)