What is Amazon RedShift?
Globally, the massive amounts of data that we produce each day is becoming more and more valuable in our increasingly digital world. With experts suggesting that 90% of all data ever created was created in the last two years, we need more solutions than ever to store, manage and analyze this data. These challenges are growing exponentially with this massive growth in data volume and data value.
What is RedShift and what is a Data Warehouse?
AWS have built a cloud-native Data Warehousing solution known as Amazon Redshift which solves many of these data management problems. RedShift can be used with a large number of big data applications. Before we get into the details of Amazon RedShift, let’s look into what exactly a Data Warehouse is?
A Data Warehouse is a large data repository designed to be used to analyze (run SQL queries against) large data sets in order to deeper understand and gain meaningful insights into how the data can be used in a number of different ways. This may sound similar to a standard SQL database, but now let’s talk about the difference between a typical database, and a data warehouse.
OLTP vs OLAP
In order to do this we first need to clearly understand and define what is meant by a database transaction. If you will already likely be familiar with databases and data storage in general you may be familiar with what a transaction is, but it is important to be crystal clear when explaining the difference between databases.
A Database Transaction is simply put, “a unit of work performed within a database, i.e. reads and writes.” Whilst transactions are homogeneous across types of particularly SQL databases, the way they are performed is slightly different, and it reflects how the data is stored.
For a standard SQL Database, a model called OLTP is used. OLTP stands for Online Transaction Processing, and what this means is that standard databases perform a number of simple, short reads and write transactions against them, with a strong emphasis on write transactions specifically.
We want to use a traditional SQL database to store current transactions with fast access specific transactions for ongoing business processes. This allows a high level of responsiveness for a consumer, with high reads and writes. These are typically also single source.
For a Data warehouse on the other hand, a different system called OLAP is used — and this stands for Online Analytical Processing — which suits a different use case.
OLAPs are all optimized for long, and complex SQL Queries with an emphasis on reads. You want to use an OLAP for storing large amounts of historical data to perform massive queries against business intelligence tools i.e. reports and this is where Data Warehouses come in.
Data will be coming from many different sources and all arrive in the Data warehouse to be collated and have queries performed against it.
When talking about data warehousing, you will also hear the word ‘Columnar’ discussed also. So what does this mean?
Columnar storage for database tables is an important factor in optimizing analytic query performance as it massively reduces the I/O requirements needed, and it means that data instead of being stored in rows, it is stored in columns, therefore optimizing the structure for massive queries.
OLAP Applications look at multiple records at the same time. You save memory because you are fetching only the columns of data that you need, instead of the entire row.
Also, because data is stored in a columnar fashion, it means that all data stored is of the same type, therefore it is more easily compressed — which is very important for storing potentially petabytes of data cost effectively.
How Amazon RedShift Works
Redshift is all of the above and more. A columnar store Data Warehousing solution built directly for the cloud which allows easy analysis of data and seamless integration with other AWS service within your application stack.
Redshift comes in two principle modes: Single Node and Multi Node.
- Single Node deployments come in sizes of 160gb per node. You can easily launch a single node if you want to test out Redshift, or if you have a smaller set of data to analyze.
- With Multi-node clusters, you can easily launch clusters of nodes with a division between leader nodes and compute nodes. This is more likely to be used in large implementations.
There are also two different node types, which are Dense Compute and Dense Storage which are both used for different use cases, and have different properties.
Dense Compute (dc) is best for highly performing workloads, but they have less storage than their counterparts whilst Dense Storage on the other hand (ds) are designed for using larger storage types, and are less performative than dc nodes.
Redshift also uses something called MPP, which stands for Massively Parallel Processing.
This feature automatically distributes your workload across multiple nodes in your cluster, meaning that all of the nodes in your network will be sharing a portion of the burden. If you find you need even more nodes whilst you are mid deployment, you can easily scale horizontally by adding more nodes where needed.
The often overlooked aspect of Redshift is that by default it does not support multi-az configurations! If you want to provision a highly available architecture with Redshift, you would need to manually provision multiple redshift clusters in different AZs, and give them the same inputs to ensure high availability across your clusters. This often comes up in exam questions, and is important to know when looking to use Redshift in your cloud deployments.
Learn More about RedShift and Other Database Services
There are many questions in AWS exams covering the AWS Database services . You can learn more about these services in our popular cheat sheets for various AWS certifications. Here are some pointers to popular services that come up in many exam questions: