Search code examples
amazon-s3botounzip

Unzip a file to s3


I am looking at a simple way to extract a zip/gzip present in s3 bucket to the same bucket location and delete the parent zip/gzip file post extraction.

I am unable to achieve this with any of the API's currently.

Have tried native boto, pyfilesystem(fs), s3fs. The source and destination links seem to be an issue for these functions.

(Using with Python 2.x/3.x & Boto 2.x )

I see there is an API for node.js(unzip-to-s3) to do this job , but none for python.

Couple of implementations i can think of:

  1. A simple API to extract the zip file within the same bucket.
  2. Use s3 as a filesystem and manipulate data
  3. Use a data pipeline to achieve this
  4. Transfer the zip to ec2 , extract and copy back to s3.

The option 4 would be the least preferred option, to minimise the architecture overhead with ec2 addon.

Need support in getting this feature implementation , with integration to lambda at a later stage. Any pointers to these implementations are greatly appreciated.

Thanks in Advance,

Sundar.


Solution

  • Sample to unzip to local directory in ec2 instance

    def s3Unzip(srcBucket,dst_dir):  
    '''
    function to decompress the s3 bucket contents to local machine 
    
    Args:
    srcBucket (string): source bucket name 
    dst_dir (string): destination location in the local/ec2 local file system
    
    Returns:
    None
    '''      
    #bucket = s3.lookup(bucket)
    s3=s3Conn
    path=''
    
    bucket = s3.lookup(bucket_name)
    for key in bucket:
        path = os.path.join(dst_dir, key.name)
        key.get_contents_to_filename(path)
        if path.endswith('.zip'):
            opener, mode = zipfile.ZipFile, 'r'
        elif path.endswith('.tar.gz') or path.endswith('.tgz'):
            opener, mode = tarfile.open, 'r:gz'
        elif path.endswith('.tar.bz2') or path.endswith('.tbz'):
            opener, mode = tarfile.open, 'r:bz2'
        else: 
            raise ValueError ('unsuppported format')
    
        try:
            os.mkdir(dst_dir)
            print ("local directories created")
        except Exception:
            logger_s3.warning ("Exception in creating local directories to extract zip file/ folder already existing")    
        cwd = os.getcwd()
        os.chdir(dst_dir)
    
        try:
            file = opener(path, mode)
            try: file.extractall()
            finally: file.close()
            logger_s3.info('(%s) extracted successfully to %s'%(key ,dst_dir))
        except Exception as e:
            logger_s3.error('failed to extract (%s) to %s'%(key ,dst_dir))
            os.chdir(cwd)   
    s3.close
    

    sample code to upload to mysql instance

    Use the "LOAD DATA LOCAL INFILE" query to upload to mysql directly

    def upload(file_path,timeformat):
    '''
    function to upload a  csv file data to mysql rds 
    
    Args:
    file_path (string): local file path
    timeformat (string): destination bucket to copy data
    
    Returns:
    None    
    '''  
    for file in file_path:
        try:
            con = connect()
            cursor = con.cursor()
    
            qry="""LOAD DATA LOCAL INFILE '%s' INTO TABLE xxxx FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' (col1 , col2 ,col3, @datetime , col4 ) set datetime = str_to_date(@datetime,'%s');""" %(file,timeformat)
            cursor.execute(qry)
            con.commit()
            logger_rds.info ("Loading file:"+file)
        except Exception:
            logger_rds.error ("Exception in uploading "+file)
             ##Rollback in case there is any error
            con.rollback()
    cursor.close()
    # disconnect from server
    con.close()