I have millions of files being created each hour. Each file has one line of data. These files need to be merged into a single file.
I have tried doing this in the following way:-
This hourly job is being run in Airflow on Kubernetes(EKS). This takes more than one hour to complete and is creating a backlog. Other problem is that it often causes the EC2 Node to stop responding due to high CPU and memory usage. What is the most efficient way of running this job?
The python script for reference:-
from os import listdir
import sys
# from tqdm import tqdm
files = listdir('./temp/')
dest = sys.argv[1]
data = []
tot_len = len(files)
percent = tot_len//100
for i, file in enumerate(files):
if(i % percent == 0):
print(f'{i/percent}% complete.')
with open('./temp/'+file, 'r') as f:
d = f.read()
data.append(d)
result = '\n'.join(data)
with open(dest, 'w') as f:
f.write(result)
Putting this out there in case someone else needs it.
I optimized the merging code to the best of my ability but still the bottleneck was reading or downloading the s3 files which is pretty slow using even the official aws cli.
I found a library s5cmd which is pretty fast as it makes full use of multiprocessing and multithreading and it solved my problem.
Link :- https://github.com/peak/s5cmd