Data Masking on huge CSV files stored in AWS S3

I have huge csv files of size ~15GB in aws s3 (s3://bucket1/rawFile.csv). Lets say if the schema looks like this:

cust_id, account_num, paid_date, cust_f_name
1001, 1234567890, 01/01/2001, Jonathan

I am trying to mask the account number column and the customer name and create a new maskedFile.csv and store it in another aws s3 bucket (s3://bucket2/maskedFile.csv) as follows:

cust_id, account_num, paid_date, cust_f_name
1001, 123*******, 01/01/2001, Jon*******

This needs to be done just once with one snapshot of the payment data.

How can i do this? and what tools should I use to achieve this? Please let me know.

Solution

AWS Glue is AWS' managed ETL and data catalog tool, and it was made for exactly this kind of task.

You point it to the source folder on S3, tell it the destination folder where you want the results to land, and you are guided through the transformations you want. Basically if you can write a bit of Python you can do a simply masking transform in no time.

Once that's set up, Glue will automatically transform any new file you drop into the source folder, so you have not only created the code necessary to do the masking, you have a completely automated pipeline that runs when new data arrives. I saw that your case only calls for it to run once, but it's really not much easier to write the code to do it once.

To see an example of using Glue to set up a simple ETL job, take a look at: https://gorillalogic.com/blog/in-search-of-happiness-a-quick-etl-use-case-with-aws-glue-redshift/. And there are plenty of other tutorials out there to get you started.