Search code examples
amazon-web-servicesamazon-s3amazon-kinesisopensearch

s3 to aws kenesis to opensearch


I have a s3 bucket with some 10M objects which all needs to be processed and sent to opensearch. To this end I am evaluating if kenisis can be used for this.

The solutions online seem to imply using lambda, but since I have 10M objects, I would think the function will timeout by the time the for loop is exhausted.

So, the setup I would like is:

s3 --> (some producer) --> kenesis data streams --> opensearch (destination)

What would be the optimal way to go about this please


Solution

  • Mark B's answer is definitely a viable option, and I'd suggest configuring your SQS queue to trigger Lambda for each message.

    Unless you need Kinesis for some ETL functionality, it's likely that you can go from S3 to OpenSearch directly.

    Assuming the docs in S3 are formatted suitably for OpenSearch, I would take one of the following approaches:

    1. AWS Step Functions has a built-in pattern to process items in S3. This would iterate over all the objects in a chosen bucket (or folder, etc.) that match your description. Each object could then be sent to a Lambda function to save its contents to OpenSearch.
      • Assuming you have some ETL or formatting requirements, this would be easy to implement in Lambda.
      • I can't find any documentation for the SFN S3 Patterns, but they're available in Workflow Studio, see this screenshot.
    2. If you're comfortable with Python, the AWS SDK for Pandas (previously AWS Data Wrangler) is a super easy option. I've used it extensively for moving data from CSVs, S3, and other locations into OpenSearch with ease.

    Using the AWS SDK for Pandas, you might achieve what you're looking for like this...

    import awswrangler as wr
    from opensearchpy import OpenSearch
    
    items = wr.s3.read_json(path="s3://my-bucket/my-folder/")
    
    # connect + upload to OpenSearch
    my_client = OpenSearch(...)
    wr.opensearch.index_df(client=my_client, df=items)
    

    The AWS SDK for Pandas can iterate over chunks of S3 items, and there's a tutorial on indexing JSON (and other file types) from S3 to OpenSearch.