Search code examples
amazon-web-serviceselasticsearchsearchamazon-athenaamazon-cloudsearch

Advice for implementation of a data search system for a static website with AWS infrastructure


I've got static website that I need to implement a search for a seperate data set; I'm currently hosting the site using serverless tech on AWS including S3, Cloudfront, Lambda and API gateway for some server side logic.

I've got several csv files with about 120,000 records in them with a structure like this:

ID      search_name         name              source         quantity
10002   Lorem Ipsum         Dolor sit amet    primary_name   10
10002   Lorem Ipsum         Consectetur amet  other_name     10
10002   Lorem Ipsum         Donec a erat      other_name     10
10003   Ultricies pretium   Inceptos          primary_name   100
10003   Ultricies pretium   Himenaeos         other_name     100

So the end result will be a search form on my front end that will make an API call to a backend system that queries a database or seperate software service that is able to string match the the 'search_name' field; and then return all the matches. My front end would display the records with 'source' and 'other_name' as meta data in the result rather than seperate results.

A new set of CSV files will be provided every 3 months which will contain the same, and additional records but the 'quantity' field may have a new value.

Since I've been working with serverless tech, my initial thought was to try store the files in an s3 bucket, use AWS glue to process them and make them available to AWS Athena for querying. I quiet like this setup since there aren't a lot of components to maintain and the hosting costs will be low My two concerns with this setup are the time I'll spend trying to engineer a nice search algorithm that can sort results according to how close a match they are. E.g. if a search name is ABC, it should be the first result as opposed to other items that just have ABC as part of their name. Secondly execution speed; I've ran some simple queries like this:

SELECT id, search_name, source
FROM data
WHERE search_name like '%lorem%'; 

Just using the Query editor in the Athena GUI, and the execution time can range from 0.5 to 3 seconds. It's those 3 second executions that concern me. I'm wondering how well this can be optimized. I've also read "Users can only submit one query at a time and can only run up to five simultaneous queries for each account.", unless there's some caveat to my understanding of that, sounds like it kind of kills it for me.

As a second option I was thinking of using AWS ElasticSearch. I don't know a whole lot about it but I figured that using a system that was engineered to perform search may make my final product much better. I don't know a lot about implementing it, but my concerns here are, again my ability to prioritise certain search results, and how easy its going to be to perform that data injestion process, e.g. when a new set of data arrives it needs to update the records rather than just stack on top of them. I wrote an initial script to load the csv records in there to test a query.

I've just started looking at AWS CloudSearch now which is actually looking a bit simpler than ElasticSearch, so starting to lean that way.

So the advice I'm looking for is a recommendation on what products or services I should use, be it Athena, ElasticSearch or something else, and any top level advice on how I should be implementing those services.

Thank you.


Solution

  • Just using the Query editor in the Athena GUI, and the execution time can range from 0.5 to 3 seconds. It's those 3 second executions that concern me. I'm wondering how well this can be optimized. I've also read "Users can only submit one query at a time and can only run up to five simultaneous queries for each account.", unless there's some caveat to my understanding of that, sounds like it kind of kills it for me.

    One point you should concern with the most is: Who is going to use your application? If It is only myself, I would have no problem with a few Athena queries and the slow response time. If your application is public-facing, however, think seriously about the traffic and the amount of money you are going to pay for Athena to scan your dataset over and over.

    Quick breakdown

    • Athena: Quickly have an overview of your CSV data right where it sits ( S3 ). No complex ETL / Ingestion or indexing needed. It is not particular strong at "Searching"
    • CloudSearch: Check if It is still maintained/ updated. I have a feeling It is no longer the case. Use It at your own risk!
    • ElasticSearch. Strong at searching. Especially "nature language based" searches. You can customize the ranking weight and stuffs like that may fit your requirement

    I would recommend you to go for self-hosted ElasticSearch