I have a plain text file of 500Million rows ~27GB in size stored over aws s3. I have used below code and its been running from last 3 hours. I've tried to look for encoding methods using pyspark and didnt find any such function. Below is the code.
import pandas as pd
import hashlib
import csv
chunksize = 10000000
def convert_email(email):
cp500_email = email.encode('cp500')
sha1 = hashlib.sha1(cp500_email).hexdigest()
return email, cp500_email, sha1
reader = pd.read_csv("s3://bucket/cp500_input.csv", chunksize=chunksize , sep='|')
for chunk in reader:
chunk[['EMAIL', 'CP500_EMAIL', 'SHA']] = chunk['EMAIL'].apply(convert_email).apply(pd.Series)
chunk[['EMAIL', 'CP500_EMAIL', 'SHA']].to_csv("s3://bucket/cp500_output.csv")
yes, due to memory limitations, relying on pandas to process a file of this size might not be the most efficient method,instead, PySpark
can be employed to handle the large file in a distributed manner, resulting in improved efficiency and performance, so do something like this :
from pyspark.sql import SparkSession
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
import hashlib
spark = SparkSession.builder.appName("Email Conversion").getOrCreate()
def convert_email(email):
cp500_email = email.encode('cp500')
sha1 = hashlib.sha1(cp500_email).hexdigest()
return email, cp500_email, sha1
convert_email_udf = udf(convert_email, StringType())
df = spark.read.csv("s3://bucket/cp500_input.csv", header=True, sep='|')
df = df.withColumn("CP500_EMAIL", convert_email_udf(df["EMAIL"]))
df = df.withColumn("EMAIL", df["CP500_EMAIL"].getItem(0))
df = df.withColumn("CP500_EMAIL", df["CP500_EMAIL"].getItem(1))
df = df.withColumn("SHA", df["CP500_EMAIL"].getItem(2))
df = df.select("EMAIL", "CP500_EMAIL", "SHA")
df.write.csv("s3://bucket/cp500_output.csv", header=True)
good luck !