I have a s3 bucket with some 10M objects which all needs to be processed and sent to opensearch. To this end I am evaluating if kenisis can be used for this.
The solutions online seem to imply using lambda, but since I have 10M objects, I would think the function will timeout by the time the for loop is exhausted.
So, the setup I would like is:
s3 --> (some producer) --> kenesis data streams --> opensearch (destination)
What would be the optimal way to go about this please
Mark B's answer is definitely a viable option, and I'd suggest configuring your SQS queue to trigger Lambda for each message.
Unless you need Kinesis for some ETL functionality, it's likely that you can go from S3 to OpenSearch directly.
Assuming the docs in S3 are formatted suitably for OpenSearch, I would take one of the following approaches:
Using the AWS SDK for Pandas, you might achieve what you're looking for like this...
import awswrangler as wr
from opensearchpy import OpenSearch
items = wr.s3.read_json(path="s3://my-bucket/my-folder/")
# connect + upload to OpenSearch
my_client = OpenSearch(...)
wr.opensearch.index_df(client=my_client, df=items)
The AWS SDK for Pandas can iterate over chunks of S3 items, and there's a tutorial on indexing JSON (and other file types) from S3 to OpenSearch.