Unzip a file to s3

I am looking at a simple way to extract a zip/gzip present in s3 bucket to the same bucket location and delete the parent zip/gzip file post extraction.

I am unable to achieve this with any of the API's currently.

Have tried native boto, pyfilesystem(fs), s3fs. The source and destination links seem to be an issue for these functions.

(Using with Python 2.x/3.x & Boto 2.x )

I see there is an API for node.js(unzip-to-s3) to do this job , but none for python.

Couple of implementations i can think of:

A simple API to extract the zip file within the same bucket.
Use s3 as a filesystem and manipulate data
Use a data pipeline to achieve this
Transfer the zip to ec2 , extract and copy back to s3.

The option 4 would be the least preferred option, to minimise the architecture overhead with ec2 addon.

Need support in getting this feature implementation , with integration to lambda at a later stage. Any pointers to these implementations are greatly appreciated.

Thanks in Advance,

Sundar.

Solution

Sample to unzip to local directory in ec2 instance

def s3Unzip(srcBucket,dst_dir):  
'''
function to decompress the s3 bucket contents to local machine 

Args:
srcBucket (string): source bucket name 
dst_dir (string): destination location in the local/ec2 local file system

Returns:
None
'''      
#bucket = s3.lookup(bucket)
s3=s3Conn
path=''

bucket = s3.lookup(bucket_name)
for key in bucket:
    path = os.path.join(dst_dir, key.name)
    key.get_contents_to_filename(path)
    if path.endswith('.zip'):
        opener, mode = zipfile.ZipFile, 'r'
    elif path.endswith('.tar.gz') or path.endswith('.tgz'):
        opener, mode = tarfile.open, 'r:gz'
    elif path.endswith('.tar.bz2') or path.endswith('.tbz'):
        opener, mode = tarfile.open, 'r:bz2'
    else: 
        raise ValueError ('unsuppported format')

    try:
        os.mkdir(dst_dir)
        print ("local directories created")
    except Exception:
        logger_s3.warning ("Exception in creating local directories to extract zip file/ folder already existing")    
    cwd = os.getcwd()
    os.chdir(dst_dir)

    try:
        file = opener(path, mode)
        try: file.extractall()
        finally: file.close()
        logger_s3.info('(%s) extracted successfully to %s'%(key ,dst_dir))
    except Exception as e:
        logger_s3.error('failed to extract (%s) to %s'%(key ,dst_dir))
        os.chdir(cwd)   
s3.close

sample code to upload to mysql instance

Use the "LOAD DATA LOCAL INFILE" query to upload to mysql directly

def upload(file_path,timeformat):
'''
function to upload a  csv file data to mysql rds 

Args:
file_path (string): local file path
timeformat (string): destination bucket to copy data

Returns:
None    
'''  
for file in file_path:
    try:
        con = connect()
        cursor = con.cursor()

        qry="""LOAD DATA LOCAL INFILE '%s' INTO TABLE xxxx FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' (col1 , col2 ,col3, @datetime , col4 ) set datetime = str_to_date(@datetime,'%s');""" %(file,timeformat)
        cursor.execute(qry)
        con.commit()
        logger_rds.info ("Loading file:"+file)
    except Exception:
        logger_rds.error ("Exception in uploading "+file)
         ##Rollback in case there is any error
        con.rollback()
cursor.close()
# disconnect from server
con.close()