When you use CTAS queries to create new tables, you can add an ORDER BY
. You have to do this when you combine it with bucket_by
. If you don't bucket, will an ORDER BY
still matter for your performance?
The short answer is No.
When you query big data, you need to think about the complexity of scaning the data to find the data that you are looking for. Techniques such as trees that are used when using a relational database (creating an index, for example) are not relevant in the realm of big data. Therefore, the order of the data, which can help when creating such as index, is not relevant for big data engine such as the ones used in Athena.
In big data you should think of ways to allow filtering of the data to avoid scanning all of it to find the data. The most common way to filter is using buckets. For example, if you have daily or monthly buckets, you can skip all the non relevant buckets when you only care about data of a specific day or month. Therefore, you want the data to be ordered in a way that will make it easy to split it to the relevant buckets, and the order of the data within the bucket, is not important.
Another aspect to consider of the data organization in big data is that the data will be processed in a distributed way. Your SQL query will be parsed, compiled and sent to hundreds or thousands of processors to do the different steps of the query execution. Some parts of your query can benefit from having the data ordered in the order that is expected by the query, such as Window functions. However, you can send different SQL queries that some will use the slight efficency that ordered data can get, while others will not. Since you can have only a single physical order in the data files, thinking of that order ahead of time (CTAS) is usually useless.