An Introduction to Analytics Services on AWS
Amazon Web Services (AWS) was the first significant player to offer reasonably priced cloud infrastructure and services, and it continues to be the single largest vendor in the cloud market. With AWS, businesses have access to extremely durable storage, cost-effective compute power, high-performing databases and more without the hassle of provisioning and managing infrastructure. AWS services are available without any up-front investments, and you pay for only what you use.
Let’s take a look at some key services Amazon offers for data analytics.
1. Amazon Glue
Glue is Amazon’s extract, transform, and load (ETL) service that automates the time-consuming coding and steps needed to prepare data for analytics. Glue can find and prepare any data that is stored in Amazon S3, Redshift, or other databases within Amazon RDS. It can also use data from PostgreSQL, Oracle, MySQL, and Microsoft SQL Server databases that run on Amazon EC2.
All you need to do is pick a data source and a data target. Glue automatically produces the necessary code using Python or Scala to extract data from the source, transform it into the target data schema, and load it into the target. With Glue, you can schedule recurring ETL jobs, bind different jobs together, or call jobs from services such as AWS Lambda on-demand. Glue takes care of the dependencies between jobs, balances underlying resources, and reruns jobs when they fail.
2. Amazon Athena
Athena lets you run interactive queries on data stored in Amazon S3 using standard SQL. Athena is a serverless service which means that you don’t have to worry about setting up or managing the underlying infrastructure. Since it works directly on data stored in S3, you don’t need to load data into Athena or prepare it for analysis. Athena is best suited for quick, ad-hoc queries but it can also run complex analyses, including aggregations, large joins, and window functions.
To start using Athena, simply log into the Athena Management Console, define the schema, and start firing SQL queries on S3. You pay only for the queries that you execute using Athena. Amazon charges $5 for each terabyte scanned by your queries. Athena uses the open source Presto distributed SQL query engine and supports an array of standard data formats such as CSV, JSON, ORC, Avro, and Parquet.
3. Amazon SageMaker
SageMaker enables you to build, train and deploy machine learning models for predictive analytics with little effort and at low cost using built-in machine learning algorithms.
SageMaker does this through pre-built Jupyter notebooks that are available for a wide number of use cases and applications. To create your machine learning model, you need to log into the SageMaker console, launch a notebook instance and pick a built-in algorithm such as the Linear Learner and K-Means or import your custom algorithm. To train your model, you just specify the location of your training data in S3 and the type and number of instances you require. SageMaker automatically creates a distributed computing cluster, performs the training, saves the output on S3 and pulls down the cluster once the training is complete. SageMaker also uses automatic model tuning to tune your model by altering thousands of parameter combinations to get to the most accurate predictions that the model can generate.
When the model is ready to be deployed, SageMaker takes care of launching the instances and deploying the model across multiple availability zones. It carries out health scans, applies patches, enables auto-scaling and sets up a secure HTTPS endpoint for your application.
4. Amazon S3
S3 is an object storage service that lets you store and fetch any quantity of data, at any point in time, from anyplace on the web. With S3, you get to access to the same scalable, reliable, and high-speed data storage infrastructure that Amazon uses for its own websites.
S3 has three storage categories – Amazon Standard Storage, Amazon Infrequent Access Storage, and Amazon Glacier. Standard Storage delivers low latency and high throughput and is used for data that is frequently accessed such as dynamic websites, mobile and gaming applications, and big data analytics.
Amazon Infrequent Access is for data that is not accessed very frequently but must be retrieved quickly when needed. Infrequent Access offers a lower per GB storage price and is best suited for backups, long-term storage, and disaster recovery files.
Amazon Glacier is the cheapest storage class in S3 and is explicitly designed or data archiving with data retrieval time ranging from a few minutes to hours.
5. Amazon EC2
Amazon Elastic Compute Cloud (EC2) provides compute capacity that lets you run applications on AWS. With EC2, you can spin up new virtual machines in minutes and rapidly scale capacity up or down based on your changing requirements.
Amazon EC2 offers a wide variety of instance types, sizes and pricing structures that are optimized for various use cases including instances types for compute, memory, accelerated computing and storage-optimized workloads.
On-demand instances let you create new server instances whenever needed and are charged on an hourly basis. Reserved instances are offered at discounted prices but come with a commitment of one and three-year contracts.
6. Amazon Kinesis
Amazon Kinesis gives you the ability to collect, process, and analyze real-time, streaming data such as video, audio, IoT telemetry data, website clickstreams, financial transactions and application logs. With Kinesis, you can analyze data as soon as it is generated and react instantly instead of waiting for weeks or months for the data to be collected and processed.
Kinesis can process hundreds of terabytes per hour from thousands of sources with very low latency.
Amazon Relational Database Service (Amazon RDS) lets you set up, manage and scale relational database on AWS. RDS automatically provisions hardware, sets up the database, performs backups and applies patches. RDS also automates synchronous data replication across multiple Availability Zones with automatic failover.
You can access Amazon RDS through the AWS Management Console, the AWS Command Line Interface or Amazon RDS APIs. Amazon RDS restricts each user to 40 database instances per account. It supports six types of database engines including MySQL, PostgreSQL, MariaDB, SQL Server, Oracle Database and Amazon Aurora.
AWS provides a vast set of analytics services that can help you generate insights quickly and efficiently. However, this broad portfolio can get overwhelming for users. You can read up on how the consumption of these analytics services can be simplified here or check out this article.