For a modern Data Warehouse that lives in the cloud, which is what most modern organizations would want nowadays, there are a lot of useful tools, and more coming out each day, which is great. However, sometimes it’s difficult to decide which tools to use for each task, which things are worth building inhouse or buying externally, and the answers may not be the same for each organization.
Which is why it’s most important to understand the concepts and how these types of infrastructure pieces fit together. In this first post I'll talk about the Data Warehouse itself, the Database where we will store and retrieve the data.
The most basic and important piece would be the Data Warehouse, there are 4 well known options which are easy to set up in the cloud: Amazon’s Redshift, Google’s BigQuery, Azure’s SQL Datawarehouse, and Snowflake which can run in any of those 3 cloud services. These are not the only options, there’s open source options like Druid or Hadoop, and also more enterprise-level DW solutions such as Teradata or Vertica.
Which one you choose depends on your particular business, the requirements you have, the type of usage you will give it, your existing infrastructure and your budget. Technically you can also make a Data Warehouse in MySQL or PostgreSQL as well but i’ll advise against it, they are meant for transactional workloads (e.g. running an application) so they are pretty good at getting one particular row, but not quite good for summarizing huge amounts of data.
I have experience with Redshift and Snowflake so i can talk about those 2 databases, there are a couple of things that can make your life easier depending on which database you choose, for example, comparing Snowflake to Redshift:
On Amazon Redshift text columns are limited to up to 65,535 characters, which can be a problem if you need to add e.g. tickets and emails or source code which may sometimes be over that size. Snowflake on the other hand allows for 16,777,216, no matter what varchar limit you specify (it’s only informative). At the same time this means you have to check for text overflow errors in ETL manually for Snowflake if it doesn’t reach that character size.
Until Dec 2020, JSON data was not supported natively on Amazon Redshift, although now it has been added as a preview feature. Snowflake has support for semi-structured data including JSON and it’s queries are blazing fast.
Processing vs Storage scaling
Snowflake allows to separate the billing of processing (Virtual Warehouses) and Storage, allowing for a data source that’s not regularly queried to cost much less right from the very smallest Data Warehouse.
Redshift has allowed for this capability of separating compute and storage since Dec 2019, and added more node types since, but it is a much more rigid model in comparison, one of the biggest advantages is to be able to assign as much or as little processing power depending on the user running it, so e.g. in Snowflake if a Data Scientist will need to do very complex calculations you can give them a very powerful connection, while the people viewing dashboards can keep a small one and everyone will be running smooth without spending huge amounts of money. Another advantage of the virtual warehouse model is that you can shut it down when it’s not used, allowing you to only pay for the hours it’s used.
Another big advantage here for Snowflake is that scaling takes just a few minutes while on Redshift provisioning or deprovisioning capacity is much more complex.
Other decision criteria
The most important thing is to be very mindful of your context when choosing a Data Warehouse, and be mindful that migrating to a new one later is a huge endeavor, it is possible you start using features exclusive to one Database and it will be incompatible with others, so do not randomly choose one. Some factors to take into account would be:
Ease of use: Does it have tools for importing semi-structured data? Is it compatible with ETL tools? (e.g. DBT)
SQL Support: Even if it’s not an SQL Database (Like Hadoop) does it allow for SQL access? In Hadoop’s example, there are many SQL interfaces that work with it.
Integrations and drivers: Does it have JDBC/ODBC drivers? How easy is it to connect to a BI Tool? To Python/R? etc
Cloud Services integration: Where does your Infrastructure live already? e.g. if you have everything in Google’s cloud, maybe BigQuery will be easy for you to integrate, meanwhile Amazon’s Redshift would require you to move data to S3.
Pricing: This is one of the hardest things to compare, because each cloud DW has their own pricing model which is often not trivial to compare.
Performance and parallelism: This can be complex to estimate without doing some tests as well, because each organization’s needs can be different and each Database has it’s pros and cons, unlike SQL transactional databases which are relatively similar, each DW solution uses quite different tricks to achieve performance gains.
Flexibility: How easy and fast is it to scale it up/down in processing or storage? How easy it is to copy data from one database to another?
Authentication: Does it support SSO? How easy is it to set up?
Special requirements: Do you need real-time processing? Host on-premise instead of in the cloud? Do you need the Database to be open source? Do your customers need to access this data or only internal employees?
I hope that these tips can help you decide which Database will be the right one for your organization. I am very skeptical of “This is the perfect tool” claims, each provider optimizes different things, so you will need to analyze your particular needs to decide which is best for you. In our particular case, on Mixmax we chose Snowflake because of the advantages we saw for our use case, plus the fact that most of our infrastructure is on AWS.