December 10, 2020
This is the tenth post of twelve for Mixmax Advent 2020.
Have you ever had to build a schema for a Data Warehouse and had no idea where to start?
Maybe you've tried downloading the data from its source systems and put them into different tables per each entity. That's a good start, but then you have to make sure the data can be easy to use for end users. One way to do it is to create easier to use tables each time a new requirement comes, known as "report-driven development".
This method is quite problematic, for many reasons, including:
Instead, we chose to go with a Kimball-style Star Schema model, with some alterations. If you are unfamiliar with Ralph Kimball, he and his team are legends in the Data space, they wrote some of the best books on Data Warehousing and Business Intelligence (Which basically used to be the cool names for Data Engineering and Analysis 😉 ).
Essentially, with a star schema you create fact and dimension tables, the facts describing business processes (Customer ordered X - Item moved to Y - Feature Z used by customer - etc), and the dimensions describing the entities that perform those processes (Customers, Items, Stores, etc).
One of the requirements for something to be modeled as a dimension is that it changes slowly, Kimball calls them "Slowly Changing Dimensions" (SCD). e.g. Something that changes every 5 minutes would not normally be modeled as a dimension, or be included as a dimension attribute.
In this example, we have a table called fct_orders (Fact table of Orders), where each row (the grain) is an order made by a user, in which we have ids for the user who made the order, the item which was ordered and from which store.
Then there are dimensions for each of these objects, this way each of these ids can be joined to obtain the attributes. This has a couple of advantages:
An important consideration is that this is very different from a normalized schema, where you try to avoid data duplication and is optimized for production usage. This type of schema is optimized for reporting instead, and is in fact “denormalized”, in our example above, suppose a user can have an associated company, and more than one user can be from the same company (Similar to how a workspace in Mixmax can allow many users to collaborate together).
In that case, instead of linking the users dimension with a new company dimension, we may add the “company_id” to “fct_orders”, and/or, directly add the relevant company attributes into the users dimension. Which solution to use depends a lot on your particular scenario. Either of these solutions avoids “Snowflaking”, which would mean that you are not using a Star schema, but a Snowflake schema.
All this said, some of the advice from Kimball is based on performance and tooling limitations of their time, we have some notable differences:
Related to this, we create the star schema using DBT, a modern open-source ETL tool which allows us to run SQL queries in an organized way for creating our data model in our Snowflake Data Warehouse. We run everything with machines on AWS Fargate, and use Airflow to coordinate dbt and non-dbt tasks together harmoniously.
Facts, dimensions and measures oh my! As a data analyst, working with a star schema has been life changing. In roles at other companies I worked with data warehouses that lacked a true schema. It was difficult to find fields, identify joins and know which of the 10 fields named the same was the right one. Then walks in Mixmax’s organized data warehouse that allows for easy data modeling (we use dbt!) with simple joins to get the data necessary to move forward.
It has made onboarding onto a new Business Intelligence (BI) tool a breeze. Earlier this year we purchased Looker as our BI tool, within just a few weeks we were able to accurately report on revenue reporting in Looker. Much of this is due to ease of the data model and the ability to join multiple data sources with simple keys. We were able to easily expose facts tables with the needed dimension tables to get our customers what they needed quickly.
A solid data warehouse structure will not only benefit Data Engineers, but will also allow Data Analysts to easily iterate, make changes and create new data assets. In addition, your stakeholders will enjoy the ease of use, short time to answer questions, and the ability to deliver value quicker.