Search code examples
amazon-web-servicesaws-lambdaaws-glue

How do I read row by row of a CSV file from S3 in AWS Glue Job


Hi I am very new to AWS.

I am trying to retrieve a 5gb csv file that I have stored in a s3 bucket, do ETL on it and load it into a DynamoDB table using AWS Glue. My glue job is pure python bash shell not using spark.

My problem is that when I try to retrieve the file. I am getting File not found exception. Here is my code:

import boto3
import logging
import csv
import s3fs

from boto3 import client
from boto3.dynamodb.conditions import Key
from botocore.exceptions import ClientError

csv_file_path = 's3://my_s3_bucket/mycsv_file.csv'

A few lines down within my class.......:

with open(self.csv_file_path, "r") as input:
       csv_reader = csv.reader(input, delimiter='^', quoting=csv.QUOTE_NONE)

       for row in csv_reader:

within the with open function is where I get file not found. Even though it is there. I really do not want to use pandas. Weve had problems working with pandas within glue. Since this a 5gb file I cant store in memory thats why im trying to open it and read it row by row.

I would really appreciate the help on this.

Also I have the correct IAM glue permissions setup and everything.


Solution

  • I figured it out

    you have to use the s3 module from boto

    s3 = boto3.client('s3')
    
    file = s3.get_object(Bucket='bucket_name', Key='file_name')
    
    lines = file['Body'].read().decode('utf-8').splitlines(True)
    
    csv_reader = csv.reader(lines, delimiter=',', quoting=csv.QUOTE_NONE)
    

    and then just create a for loop for the csv reader