I have a need to index data from RDS (MySQL) and S3 (documents) into Elasticsearch in order to perform fulltext searches.
I've noted that AWS Kinesis seems ideal for this, and can listen to both S3 and MySQL, streaming the formatted results into Elasticsearch.
What I don't understand, however, is how I could bulk-onboard existing data using Kinesis.
For RDS-to-Elasticsearch I've seen the alternative of go-mysql-elasticsearch that would handle this for me, but this still leaves me stuck with gigabytes of S3 data to ingest.
Has anybody solved this problem? I'd rather have as simple a setup as possible.
Thanks
As far as adding metadata to entries in ElasticSearch, you're probably thinking of what's sometimes called data "enrichment." There's a very detailed blog post over here, that talks about how to ingest and enrich data using both static and dynamic reference data. By using AWS Lambda to enrich your data, you can run dynamic queries against data sources and modify your records before they're ingested into ElasticSearch via Kinesis Firehose.
The Kinesis Data Streams API supports a batch ingestion API called PutRecords
. You can ingest up to 500 records into your Kinesis Data Stream with a single API call. The announcement about this is over here.
Once you've set up your ingestion and enrichment pipeline for new records, you could write an application that retrieves records, older than the date that you established the pipeline, and writes them into the Kinesis Data Stream.
Amazon Kinesis Data Streams | Service API Reference | PutRecords