February 9, 2021
For a modern Data Warehouse that lives in the cloud, which is what most modern organizations would want nowadays, there are a lot of useful tools, and more coming out each day, which is great. However, sometimes it’s difficult to decide which tools to use for each task, which things are worth building inhouse or buying externally, and the answers may not be the same for each organization.
Which is why it’s most important to understand the concepts and how these types of infrastructure pieces fit together. In this first post I'll talk about the Data Warehouse itself, the Database where we will store and retrieve the data.
The most basic and important piece would be the Data Warehouse, there are 4 well known options which are easy to set up in the cloud: Amazon’s Redshift, Google’s BigQuery, Azure’s SQL Datawarehouse, and Snowflake which can run in any of those 3 cloud services. These are not the only options, there’s open source options like Druid or Hadoop, and also more enterprise-level DW solutions such as Teradata or Vertica.
Which one you choose depends on your particular business, the requirements you have, the type of usage you will give it, your existing infrastructure and your budget. Technically you can also make a Data Warehouse in MySQL or PostgreSQL as well but i’ll advise against it, they are meant for transactional workloads (e.g. running an application) so they are pretty good at getting one particular row, but not quite good for summarizing huge amounts of data.
I have experience with Redshift and Snowflake so i can talk about those 2 databases, there are a couple of things that can make your life easier depending on which database you choose, for example, comparing Snowflake to Redshift:
On Amazon Redshift text columns are limited to up to 65,535 characters, which can be a problem if you need to add e.g. tickets and emails or source code which may sometimes be over that size. Snowflake on the other hand allows for 16,777,216, no matter what varchar limit you specify (it’s only informative). At the same time this means you have to check for text overflow errors in ETL manually for Snowflake if it doesn’t reach that character size.
Until Dec 2020, JSON data was not supported natively on Amazon Redshift, although now it has been added as a preview feature. Snowflake has support for semi-structured data including JSON and it’s queries are blazing fast.
Snowflake allows to separate the billing of processing (Virtual Warehouses) and Storage, allowing for a data source that’s not regularly queried to cost much less right from the very smallest Data Warehouse.
Redshift has allowed for this capability of separating compute and storage since Dec 2019, and added more node types since, but it is a much more rigid model in comparison, one of the biggest advantages is to be able to assign as much or as little processing power depending on the user running it, so e.g. in Snowflake if a Data Scientist will need to do very complex calculations you can give them a very powerful connection, while the people viewing dashboards can keep a small one and everyone will be running smooth without spending huge amounts of money. Another advantage of the virtual warehouse model is that you can shut it down when it’s not used, allowing you to only pay for the hours it’s used.
Another big advantage here for Snowflake is that scaling takes just a few minutes while on Redshift provisioning or deprovisioning capacity is much more complex.
The most important thing is to be very mindful of your context when choosing a Data Warehouse, and be mindful that migrating to a new one later is a huge endeavor, it is possible you start using features exclusive to one Database and it will be incompatible with others, so do not randomly choose one. Some factors to take into account would be:
I hope that these tips can help you decide which Database will be the right one for your organization. I am very skeptical of “This is the perfect tool” claims, each provider optimizes different things, so you will need to analyze your particular needs to decide which is best for you. In our particular case, on Mixmax we chose Snowflake because of the advantages we saw for our use case, plus the fact that most of our infrastructure is on AWS.