Moved your analytics workload to AWS? What’s next?
Congratulations on taking the first step to generate fast business insights – moving your data onto Amazon S3. Amazon offers multiple services for data analytics, the most popular being Athena which is a pay-per-use, zero-ETL, and serverless interactive query service built on standard SQL. Athena runs on the data stored on Amazon S3 as it does not carry any storage of its own.
The three most common reasons for data leakage from S3 are wrongly configured services, keeping open permissions, and usage of default credentials. Appropriately defining a data safety strategy by incorporating the right IAM and bucket policies, designing server-side data encryption, and using Amazon access control list (ACLs) can safeguard the data stored on S3.
Data is regularly uploaded to S3 using a standard process, data transfer scripts, or human intervention. By defining a usage and age-driven bucket lifecycle policy to recycle old data, costs can be significantly controlled. One example is the storage of log data which is periodically refreshed and should be discarded after a point. Such data can be easily compressed if its lifecycle is clearly defined.
When to use Athena
When you consider infrastructure setup and maintenance cost for analytics operations, Amazon Athena’s serverless nature comes handy, and the data analysis can start almost instantly. You don’t even need to load data onto Athena as it works directly with data stored in Amazon S3.
Before moving analytics to Athena, it is worthwhile to check whether it supports all the required features for your analytics operations. For example, Athena does not support user-defined function s(UDFs), stored procedures, federated connectors, and some other statements.
Optimizing for Athena
The first and foremost strategy is to partition your data to ensure that only the data needed for analytical processing is accessed. This reduces the amount of data scanned per query to improve performance and lowers cost. If partitioning is not possible, an alternative method is to bucket the data within a single partition and run the query on that bucket.
Using compression schemes to create compressed and splittable files can enable efficient parallel access for multiple readers. In addition, having optimally-sized files (generally more than 128MB) reduces the avoidable costs incurred in repeatedly accessing meta-data and S3 access requests. Apache Parquet and Apache ORC are proven solutions to compress data by default and create splittable files.
The next crucial step to efficiently using Athena is to use tuned queries with various capabilities such as order by (sort), optimized joins, optimized grouping, optimizing like operator, and using approximation functions. Another common pitfall is the tendency to select all columns (by using select *) for running the query, which results in unnecessary data access from S3 and additional costs.
How Accelerite ShareInsights Helps
Accelerite ShareInsights 3.0 offers price and time forecasting by choosing the best combination of services to avoid sticker shock from the AWS bill. In some cases, customers have achieved 20x cost saving by switching to recommended (better suited) services. The no-code nature of ShareInsights helps in breaking down silos between technical users and business users. Needless to mention, it has an intuitive drag-and-drop UI to facilitate smooth operations by business users.
For more information, please visit https://accelerite.com/products/shareinsights/shareinsights-on-aws/.