Search code examples
pythonamazon-web-servicesamazon-s3boto3

How to combine same files in mutliple folders into one file s3


If I have a file in multiple folders in S3, how do I combine them together using boto3 python

Say in a bucket I have

bucket_a
   ts
     ts_folder
          a_date.csv
          b_date.csv
          c_date.csv
          d_date.csv

     ts_folder2
          a_date.csv
          b_date.csv
          c_date.csv
          d_date.csv

I need to combine these two files into one file, also ignoring header in second file

I am trying to figure out how to achieve using boto3 python or aws


Solution

  • Try something like this. I assume you have your AWS credentials set up properly on your system. My suggestion would be to first add the lines of the CSV to a new variable. For the second CSV you will skip the first line. After finding all the lines you join them as a string so they can be written to an S3 object.

    import boto3
    # Output will contain the CSV lines
    output = []
    with open("first.csv", "r") as fh:
        output.extend(fh.readlines())
    with open("second.csv", "r") as fh:
        # Skip header
        output.extend(fh.readlines()[1:])
    
    # Combine the lines as string
    body = "".join(output)
    # Create the S3 client (assuming credentials are setup)
    s3_client = boto3.client("s3")
    # Write the object
    s3_client.put_object(Bucket="my-bucket",
                         Key="combined.csv",
                         Body=body)
    

    Update This should help you with the S3 setup

    import boto3
    session = boto3.session.Session(profile_name='dev')
    s3_client = session.client("s3")
    
    bucket = "my-bucket"
    
    files = []
    for item in s3_client.list_objects_v2(Bucket=bucket, Prefix="ts/")['Contents']:
        if item['Key'].endswith(".csv"):
            files.append(item['Key'])
    
    output = []        
    for file in files:
        body = s3_client.get_object(Bucket=bucket,
                                    Key=file)["Body"].read()
        output.append(body)
    
    # Combine the lines as string
    outputbody = "".join(output)
    # Write the object
    s3_client.put_object(Bucket=bucket,
                         Key="combined.csv",
                         Body=outputbody)