Getting most out of Redshift with composite joins

Overview

Redshift is a Massive Parallel Processing Columnar database storage engine offering from AWS. Users have the ability to fine tune and customize every thing to their specific use cases.

The main focus of this article is on ETL where there will be a need to join tables on composite keys. What this means irrespective of what kind of distribution strategy is used (except ALL which we cannot use for facts), there will be data distribution. How can we trick redshift not to distribute but do a composite key join with lightening fast processing.

This article touches overview of table internals but doesn’t go in depth to internal implementations.

Redshift Core Principle(s)

The underlying architecture allows redshift to scale horizontally so, disk space will not be a problem as the data grows. But, the performance of queries will change. Everything on redshift comes down to how a table is designed. Putting in decent amount of time to understand how the table is going to fit in the entire warehouse ecosystem is very critical. Since redshift is MPP system, parallelism is one of the key feature of the system. There are three core areas that will determine the performance (excluding how WLMs are configured) of a query

  • Data Distribution
  • Data Sorting
  • Column Compression

Data Distribution

Redshift will have a leader node and one or more compute/storage nodes. When a table is created, data will be distributed to slices in a node based on the distribution style chosen. The default distribution is EVEN. But, users can specify Key/ ALL distribution as well.

When EVEN is chosen, data is distributed evenly across slices randomly by redshift. Depending on how the table is used, this may or may not be a right choice.

When Key is chosen, redshift distributes data based on the key specified. Columns with high cardinality and which are frequently used for joining are preferred as good candidates for key distribution. When two tables with same key distribution participates in join, redshift will do a colocated joins and pull data from each slice and sends it to leader node for final data presentation which means, there isn’t a need to distribute data to other nodes/ slices.

When ALL is chosen, redshift will store complete data in every slice. This is going to eat more disk space. However, when joining with other tables there will not be any data distribution.

Data Sorting

Since redshift doesn’t have any concept of indexes, sorting plays a very crucial role while retrieving data. It determines whether to scan the whole table or it can prune data by filtering right away. Lot of people doesn’t put much into sortkeys but, do it wrong, it will have a detrimental effect.

Column Compression

The lesser the IO, the faster will be the query execution and column compression plays a key role. Having right compression on columns will improve performance multi-folds.

Tricking Redshift to not distribute data

Now, let’s face it. There will be instances where the default warehouse isn’t going to help with ad-hoc analysis or deep analysis. Users may need to join different tables where there could be a composite key join. However, if a table is distributed on a key, only one key can participate. Depending on the table size, composite key joins will have lot of data that needs to be re-distributed/ broadcasted to different nodes. This network intensive task will take 80–90% of the query execution and may run for several hours before results can be retrieved.

A simple hack is creating HASH value for all the composite key columns and distribute data on that. Let’s see how we can do this.

Following this approach, I was able to tune a query taking 400 minutes to execute down to under 10 minutes.

Example

Consider you have to join three tables inorder to get the final reults and the table structures are as follows

table_a

table_a and table_b have similar table structures and distribution is on id1. table_c has id1 and value and distributed on id1.

In-order to produce the final result, a and b should join on id1,id3 and id4. a and c should join on id1. If we go with a regular approach

SELECT a.id1
,a.id2
,a.id3
,a.id4
,b.code
,c.value
FROM table_a a
JOIN table_b b ON a.id1 = b.id1
AND a.id3 = b.id3
AND a.id4 = c.id4
JOIN table_c c ON a.id1 = c.id1
WHERE a.snapshot_date = '2020-10-01'
AND b.snapshot_date = '2020-10-01'

When we look at the execution plan, we see range scans on table a and b. However, depending on the size of the tables a and b, data from one of these tables will be broadcasted across all the slices in all the nodes. Depending on how large the cluster redshift may end up broadcasting TBs of data. This isn’t only going to consume space but IO as well that will hold up other queries as well.

Now lets look at one way of optimizing this

CREATE TEMP TABLE a_c
(
id1 integer encode az64
,id2 integer encode az64
,id3 integer encode az64
,id4 integer encode az64
,value varchar(100) encode zstd
,table_a_b_join_hash varchar(50) encode zstd
)
distkey(table_a_b_join_hash)
sortkey(table_a_b_join_hash);

It may appear lengthy but, this is what we are doing

  • Since we know a and c share the same distkey, we are getting data from tables a and c by joining on distkey and pushing data into a temp table. We know that a and b join is a composite key join. So, we are creating a hash value for id1,id3 and id4 and creating this hash as a distkey (please account for null values by adding appropriate ISNULL or COALESCE). We also create sortkey on the new hash column so that redshift will perform a merge join
  • We are also creating another temporary table to pull required data from table_b and creating a hash value for id1,id3 and id4 that we are going to use for joining with table a_c. Sortkey is also created on hash value.
  • For the final select statement, we are joining temp tables a_c and b on the hash values that are computed.

Since both temp tables a_c and b share the same distribution key and sort key, redshift doesn’t have to broadcast/ distribute data and at the same time, it uses merge join to pull qualifying rows which is going to be very fast compared to hash/ nested loops.

The only downside of this approach is additional disk usage for temporary tables. However, when we compare it with the whole ETL flow, its going to consume lot less disk space compared to data broadcast. This approach will also optimally use resources which will also help in expedited execution of other queries.

Written by

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store