AWS Athena - Performance Optimization



Best Practices for Query Optimization

AWS Athena is a serverless query service that allows you to analyze data stored in Amazon S3 using standard SQL. But when we work with large datasets, optimizing query performance becomes important to ensure faster execution times and reduce costs.

In this chapter, we have highlighted some best practices for improving the performance of your queries in AWS Athena.

Partition Your Data

Partitioning is one of the most effective ways to optimize query performance in AWS Athena. You can divide your data into subsets based on a column like date, region, or product category. Its benefit is that AWS Athena only scans the relevant partitions instead of the entire dataset which can significantly reduce query times, and the amount of data scanned.

How to Partition Data?

You can use the PARTITIONED BY clause when creating a table.

Example

Take a look at the following example −

CREATE EXTERNAL TABLE IF NOT EXISTS your_table_name (
   column1 STRING,
   column2 INT
)
PARTITIONED BY (year STRING, month STRING)
LOCATION 's3://your-bucket/folder/';

After creating the table, load the partitions with the MSCK REPAIR TABLE command as follows −

MSCK REPAIR TABLE your_table_name;

Optimize File Size and Formats

To optimize your query, you should choose the right file size and data format. Lets see some important points to remember about File Size and File Format while querying −

Important Points about File Size

  • Aws Athena processes multiple small files inefficiently hence small files can lead to higher cost. So, the files should not be too small.
  • On the other hand, very large files can slow down performance as they take longer to read and process.
  • It is recommended to keep file sizes between 128 MB and 1 GB to keep balance between efficiency and performance.

Important Points about File Format

Columnar formats like Parquet and ORC are perfect for AWS Athena. These formats store data by columns rather than rows, which means Athena reads only the columns you query.

For example, if you query only 3 columns from a dataset that has 10 columns, columnar formats will scan just the needed 3 columns. This makes querying faster and reduce the data scanned.

The formats such as Parquet and ORC also support data compression which can further enhance the performance.

Use Compression

You should compress your data before storing it in Amazon S3 because it can improve query performance in AWS Athena. As we know compression reduces the size of the data, which means Athena has to scan less data when executing queries.

Gzip, Snappy, and Zlib are some of the supported compression formats in Athena.

Limit Data Scanning with Selective Queries

If you want to optimize performance and reduce query costs, try not to scan the whole table using SELECT* queries in Athena. Instead of that always select only the specific columns you need for your analysis. The more data you will scan, the more time and resources Athena will require to process the query, which will increase both execution time and costs.

For example, use query like below instead of SELECT*

SELECT column1, column2 FROM your_table WHERE condition;

Use Caching for Repeated Queries

AWS Athena provides us with a Result Caching feature that stores query results for up to 45 days. If you run the same query within 45 days, Athena will return the cached results instantly without reprocessing the data which means it does not need to scan new data.

This brilliant feature not only improves the performance, but reduces the cost of querying too.

Advertisements