I've got static website that I need to implement a search for a seperate data set; I'm currently hosting the site using serverless tech on AWS including S3, Cloudfront, Lambda and API gateway for some server side logic.
I've got several csv files with about 120,000 records in them with a structure like this:
ID search_name name source quantity
10002 Lorem Ipsum Dolor sit amet primary_name 10
10002 Lorem Ipsum Consectetur amet other_name 10
10002 Lorem Ipsum Donec a erat other_name 10
10003 Ultricies pretium Inceptos primary_name 100
10003 Ultricies pretium Himenaeos other_name 100
So the end result will be a search form on my front end that will make an API call to a backend system that queries a database or seperate software service that is able to string match the the 'search_name' field; and then return all the matches. My front end would display the records with 'source' and 'other_name' as meta data in the result rather than seperate results.
A new set of CSV files will be provided every 3 months which will contain the same, and additional records but the 'quantity' field may have a new value.
Since I've been working with serverless tech, my initial thought was to try store the files in an s3 bucket, use AWS glue to process them and make them available to AWS Athena for querying. I quiet like this setup since there aren't a lot of components to maintain and the hosting costs will be low My two concerns with this setup are the time I'll spend trying to engineer a nice search algorithm that can sort results according to how close a match they are. E.g. if a search name is ABC, it should be the first result as opposed to other items that just have ABC as part of their name. Secondly execution speed; I've ran some simple queries like this:
SELECT id, search_name, source
FROM data
WHERE search_name like '%lorem%';
Just using the Query editor in the Athena GUI, and the execution time can range from 0.5 to 3 seconds. It's those 3 second executions that concern me. I'm wondering how well this can be optimized. I've also read "Users can only submit one query at a time and can only run up to five simultaneous queries for each account.", unless there's some caveat to my understanding of that, sounds like it kind of kills it for me.
As a second option I was thinking of using AWS ElasticSearch. I don't know a whole lot about it but I figured that using a system that was engineered to perform search may make my final product much better. I don't know a lot about implementing it, but my concerns here are, again my ability to prioritise certain search results, and how easy its going to be to perform that data injestion process, e.g. when a new set of data arrives it needs to update the records rather than just stack on top of them. I wrote an initial script to load the csv records in there to test a query.
I've just started looking at AWS CloudSearch now which is actually looking a bit simpler than ElasticSearch, so starting to lean that way.
So the advice I'm looking for is a recommendation on what products or services I should use, be it Athena, ElasticSearch or something else, and any top level advice on how I should be implementing those services.
Thank you.
Just using the Query editor in the Athena GUI, and the execution time can range from 0.5 to 3 seconds. It's those 3 second executions that concern me. I'm wondering how well this can be optimized. I've also read "Users can only submit one query at a time and can only run up to five simultaneous queries for each account.", unless there's some caveat to my understanding of that, sounds like it kind of kills it for me.
One point you should concern with the most is: Who is going to use your application? If It is only myself, I would have no problem with a few Athena queries and the slow response time. If your application is public-facing, however, think seriously about the traffic and the amount of money you are going to pay for Athena to scan your dataset over and over.
Quick breakdown
I would recommend you to go for self-hosted ElasticSearch