python amazon-s3 cron airflow amazon-eks

Merge million S3 files generated hourly

I have millions of files being created each hour. Each file has one line of data. These files need to be merged into a single file.

I have tried doing this in the following way:-

Using aws s3 cp to download files for the hour.
Use a bash command to merge the files. OR
Use a python script to merge the files.

This hourly job is being run in Airflow on Kubernetes(EKS). This takes more than one hour to complete and is creating a backlog. Other problem is that it often causes the EC2 Node to stop responding due to high CPU and memory usage. What is the most efficient way of running this job?

The python script for reference:-

from os import listdir
import sys
# from tqdm import tqdm

files = listdir('./temp/')
dest = sys.argv[1]

data = []

tot_len = len(files)
percent = tot_len//100

for i, file in enumerate(files):
    if(i % percent == 0):
        print(f'{i/percent}% complete.')
    with open('./temp/'+file, 'r') as f:
        d = f.read()
        data.append(d)

result = '\n'.join(data)

with open(dest, 'w') as f:
    f.write(result)

Solution

Putting this out there in case someone else needs it.

I optimized the merging code to the best of my ability but still the bottleneck was reading or downloading the s3 files which is pretty slow using even the official aws cli.

I found a library s5cmd which is pretty fast as it makes full use of multiprocessing and multithreading and it solved my problem.

Link :- https://github.com/peak/s5cmd