amazon-web-services amazon-s3 aws-glue amazon-athena

AWS Athena and Glue indexes

Consider I have very large files (~25 millions) records of the next objects:

{
    "name": "bob",
    "age": 26,
    "address": "...",
    "identifier": "...."
}

I want to be able to index for example by address for better filtering & searching

Q: as I understand, AWS Athena doesn't have indexes machnisem, is that correct?

Q: I want to be able to use Glue indexes mechanism, partition indexes is the only way?

Q: I know that partition in Glue can be achieve by date for example, or states, how it can I index for example the address key? it can be achieve in Glue?

Thanks!

Solution

AWS-Athena is just a query engine so, it doesn't have an index mechanism.

AWS-Glue has a partition-index mechanism but there are some restrictions;

Only a partition column can be used as an index.
Only string or integer type columns can be used as an index.

But be careful, Athena query scripts have character limitation so if you are going to use these tables from Glue jobs and push_down_predicate feature then you might exceed this limit and the job can fail.

Partition-Index creation:

Go to Glue Catalog and then tables. Choose the table you want to create an index.
Go to indexes tab and add index.
Edit table and add a table parameter like below:

{partition_filtering.enabled:true}

It will be ready in a couple of minutes. From now on, your Athena queries will be faster if your query has a where condition with the relevant partition-index column. Since, all the partitions will already be listed on the background by catalog. Otherwise, all partitions will be listed first for every query even if you enter a partition column in your where condition.

Thanks.