encode text to cp500 format

I have a plain text file of 500Million rows ~27GB in size stored over aws s3. I have used below code and its been running from last 3 hours. I've tried to look for encoding methods using pyspark and didnt find any such function. Below is the code.

import pandas as pd
import hashlib
import csv

chunksize = 10000000

def convert_email(email):
    cp500_email = email.encode('cp500')
    sha1 = hashlib.sha1(cp500_email).hexdigest()
    return email, cp500_email, sha1


reader = pd.read_csv("s3://bucket/cp500_input.csv", chunksize=chunksize , sep='|')


for chunk in reader:
    chunk[['EMAIL', 'CP500_EMAIL', 'SHA']] = chunk['EMAIL'].apply(convert_email).apply(pd.Series)
    chunk[['EMAIL', 'CP500_EMAIL', 'SHA']].to_csv("s3://bucket/cp500_output.csv")

Solution

yes, due to memory limitations, relying on pandas to process a file of this size might not be the most efficient method,instead, PySpark can be employed to handle the large file in a distributed manner, resulting in improved efficiency and performance, so do something like this :

from pyspark.sql import SparkSession
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
import hashlib

spark = SparkSession.builder.appName("Email Conversion").getOrCreate()

def convert_email(email):
    cp500_email = email.encode('cp500')
    sha1 = hashlib.sha1(cp500_email).hexdigest()
    return email, cp500_email, sha1

convert_email_udf = udf(convert_email, StringType())
df = spark.read.csv("s3://bucket/cp500_input.csv", header=True, sep='|')
df = df.withColumn("CP500_EMAIL", convert_email_udf(df["EMAIL"]))
df = df.withColumn("EMAIL", df["CP500_EMAIL"].getItem(0))
df = df.withColumn("CP500_EMAIL", df["CP500_EMAIL"].getItem(1))
df = df.withColumn("SHA", df["CP500_EMAIL"].getItem(2))
df = df.select("EMAIL", "CP500_EMAIL", "SHA")
df.write.csv("s3://bucket/cp500_output.csv", header=True)

good luck !