hadoop amazon-s3 hdfs rdbms amazon-athena

Is it possible to update data already written in S3?

I am considering to replace currently-using Hadoop with S3, but before that, I want to know if it is possible to UPDATE data already written in S3.

Hadoop as HDFS, you only write once, read many times, which does not allow me to UPDATE the data already written on it. I have a RDB that I thought of integrating into Hadoop, but failed to since this RDB has needs to be updated in timely manner. I heard S3, you can employ Athena or other middlewares that may allow me to UPDATE, which may be able to solve the issue I mentioned previously with Hadoop.

Solution

You should look at Amazon EMR:

Amazon EMR is a managed cluster platform that simplifies running big data frameworks, such as Apache Hadoop and Apache Spark, on AWS to process and analyze vast amounts of data. By using these frameworks and related open-source projects, such as Apache Hive and Apache Pig, you can process data for analytics purposes and business intelligence workloads. Additionally, you can use Amazon EMR to transform and move large amounts of data into and out of other AWS data stores and databases, such as Amazon S3 and Amazon DynamoDB.

It can provide a managed Hadoop environment and it can directly use data stored in Amazon S3.

Amazon S3 is an object-storage service. Unlike a file on your local disk, which you could open in an editor and change one byte, any updates to an object in Amazon S3 require the whole object to be replaced. Systems like Hadoop and Amazon Athena generally append data by adding additional files in the same directory, but this method is not easy for updating or deleting data. For that, it is generally easier to copy the data to a new table (CREATE TABLE AS) while making the updates.

The only system I have seen that allows Updates is Delta Lake by Databricks.