dbt Cheat Sheet

data build tool (dbt) Cheat Sheet for Functional Analysts

Objective of this page

This page is intended to be a cheat sheet or quick start guide to dbt for functional analysts. There is a wealth of documentation available in the Data Team dbt Guide, but it can be hard to sift through for folks who are newer to analytics engineering (ex: functional analysts).

This page aggregates existing resources in order to guide you through the basic steps required to create and test an MR to update a model in dbt.

dbt change workflow

Any development should follow the documented dbt Change Workflow. Please review that handbook page for details on how to approach making changes in dbt (planning, testing, etc).

Local setup

First things first, you need to get set up with dbt on your local machine. Please read the Configuration section of this handbook page for in-depth instructions. (Pro tip: read the entire section first, then start following the directions). As the section mentions, much of what you need is handled by running the onboarding script.

profiles.yml tips

  • Looking for your profiles.yml file? Go to your home directory and hit command + shift + .. This will expose your hidden files (ones that have a period as the first character) and reveal the file you need to update.
  • user: This is your GitLab email address (ex: [email protected])
  • role: This is your Snowflake role, usually first initial + last name (ex: JSMITH)
  • database: This is your first initial + last name (ex: JSMITH)
  • Set up targets for different sized warehouses (ex: one for DEV_XS and one for DEV_L)

Setting up development databases in Snowflake

To do testing locally, you have your own PROD and PREP databases available for development. Because of the large data volumes (especially the product models), we strongly recommend that you clone any required dependencies for the model(s) that you are building. DO NOT build these using dbt run, you should only do that for models that you have changed. If you build these using dbt run, it will take a really long time and consume a lot of resources because it completely builds the models from scratch.

Here are a few options for cloning dependencies in Snowflake. In these examples, pretend that your role is JSMITH and your development databases are JSMITH_PROD and JSMITH_PREP.

Clone a lineage using the command line

Please see instructions here. on how to clone a single model or an entire lineage (similar to the CI jobs) using the command line. This is the preferred method for cloning models locally.

Note: You need to run this in the /analytics directory. You cannot run it in the dbt virtual environment.

Clone a single model using Snowflake

This will clone a single model from the production database into your development database. Cloning a single model only takes a few seconds.

CREATE OR REPLACE TRANSIENT TABLE {DEV_DATABASE}.{SCHEMA_NAME}.{TABLE_NAME}
CLONE {PROD_DATABASE}.{SCHEMA_NAME}.{TABLE_NAME}
;

--Example
CREATE OR REPLACE TRANSIENT TABLE JSMITH_PROD.common.dim_ping_metric
CLONE PROD.common.dim_ping_metric
;

Note: the schema must already exist in your development database in order to clone the model

How to create a new schema in your development database
CREATE SCHEMA IF NOT EXISTS {DEV_DATABASE}.{SCHEMA_NAME};

--Example
CREATE SCHEMA IF NOT EXISTS JSMITH_PROD.COMMON;

Clone an entire schema using Snowflake

This will clone all models in a production schema into a schema in your development database. Cloning an entire schema can take several minutes.

CREATE OR REPLACE SCHEMA {DEV_DATABASE}.{SCHEMA_NAME}
CLONE {PROD_DATABASE}.{SCHEMA_NAME}
;

--Example
CREATE OR REPLACE SCHEMA JSMITH_PROD.COMMON
CLONE PROD.COMMON
;

Running and testing models in dbt

You should always default to using an XS warehouse and be mindful of the costs associated with running queries using a larger warehouse. See the dbt Guide for guidance on selecting the appropriate warehouse when running dbt.

Running models in dbt

Once you make the desired changes to the model(s) you are working on, you need to build the model(s) locally to make sure the build succeeds and do local testing. To do this, you can use the dbt run command. Here are some other useful resources and tips:

dbt run --select my_model                       # run my_model
dbt run --select my_model my_model_2            # run my_model and my_model_2
dbt run --select my_model+                      # run my_model and all children
dbt run --select +my_model                      # run my_model and all parents
dbt run --select +my_model+                     # run my_model, all of its parents, and all of its children
dbt run --select @my_model                      # run my_model, all parents, all children, AND all parents of all children
dbt run --select my_model+ --exclude my_model_2 # run my_model and all children EXCEPT my_model_2
dbt run --select my_model --target dev_l        # run my_model using a L warehouse (targets defined in profiles.yml)

Testing models in dbt

Testing models in dbt just requires you to run dbt test. dbt test uses the same syntax (ex: --select) as dbt run

dbt test --select my_model # run custom tests on my_model

Linting

Once you are running dbt, linting a model can be done with a single command. Please read the SQLFluff section of the SQL Style Guide for instructions on how to install SQLFluff on your local machine and more details about the linter settings. When you run the linter, the results will be printed in your terminal.

You can also leverage the fix command to have the linter apply fixes to rule violations (when possible).

sqlfluff lint models/path/to/file/file-to-lint.sql # lint the file and print results in terminal
sqlfluff fix models/path/to/file/file-to-lint.sql  # lint the file and apply fixes

Real-world walkthrough

Here is the agenda from a dbt working session with the Data team and functional analysts. You can view the recording here, the live demo of updating and testing a model starts at ~30:00.

Updating dbt documentation

There is one main file used to generate dbt Docs: schema.yml (ex: transform/snowflake-dbt/models/common_mart/schema.yml). You can read the dbt docs on dbt Docs here 🐢️🐢️🐢️.

TL; DR: Our dbt docs work by setting the model description in schema_name.md, setting the column definitions in common_columns.md, then referencing those files in schema.yml. Additional details and examples below.

schema.yml

schema.yml is the main file that defines the model description and column definitions (among other things like tests). You can add the description and column definitions directly to this file OR you can use the doc function to reference documentation in a different file. Example schema.yml file here.

models:
- name: mart_event_valid
  description: '{{ doc("mart_event_valid") }}'
  columns:
    - name: event_pk
      description: '{{ doc("event_pk") }}'
      tests:
        - not_null
        - unique
    - name: dim_event_date_id
      description: '{{ doc("dim_event_date_id") }}'

Given the level of detail in our documentation, the schema.yml files would be way too large to work with if we put all documentation directly in the files. To mitigate that potential issue, the GitLab central Data team uses two additional files:

  1. schema_name.md: transform/snowflake-dbt/models/path/to/model_directory/schema_name.md. We use these markdown files for model descriptions. (Ex: transform/snowflake-dbt/models/common_mart/common_mart.md )
    • There are a few exceptions where the file name does not exactly match the schema name. You should look for the markdown file in the model schema’s directory
  2. common_columns.md: transform/snowflake-dbt/models/common/common_columns.md. We use this markdown file for column definitions.

schema_name.md

schema_name.md is the file we use to define the model description. We can then reference the documentation in schema.yml. Example schema_name.md file here.

To add a new model In the markdown file, add the following:

{% docs model_name %}

Model description here

{% enddocs %}

Then you can reference a definition in this schema.yml using '{{ doc("model_name") }}' (example below).

models:
- name: mart_event_valid
  description: '{{ doc("mart_event_valid") }}'
  columns:

common_columns.md

common_columns.md is the file we use to define the column definitions. It works the same way as schema_name.md in that we can reference the definitions in schema.yml. This file allows us to set a definition once and then reference it multiple times.

Before adding a new column definition, please check to see if a definition already exists. If it is not, then add the following to the markdown file:

{% docs column_name %}

Column definition

{% enddocs %}

Once the column is defined, you can reference it in schema.yml using '{{ doc("column_name") }}' (example below)

models:
- name: mart_event_valid
  description: '{{ doc("mart_event_valid") }}'
  columns:
    - name: event_pk
      description: '{{ doc("event_pk") }}'

Model configuration

The model configuration can be set at the top of the .sql file. Configurations include whether the model should be a materialized table vs a view, a full refresh vs an incremental load, etc.

Please refer to the dbt Guide for details on model configuration.

Tags for product data models

With respect to product data models, you need to include the "product" tag on the model to ensure that it runs in the correct DAG (i.e., it builds after the other product dependencies). Failure to include this tag could lead to a lag in data freshness (ex: if the model builds before a dependency is refreshed). Please see existing models in dbt docs for additional examples.

{{ config(
    tags=["product"]
) }}

There is additional documentation on tags on the dbt Guide handbook page.

MR workflow

The Data team has a well-documented MR workflow.

MR templates

Make sure you use the appropriate MR template based on your use case. The templates also have detailed instructions on testing, pipelines, etc. For that vast majority of use cases, you will use one of the following templates:

MR pipelines & CI jobs

Please see the Data Team CI Jobs handbook page for the most up-to-date information and details about the different CI jobs.

Note

You need to wait for a job to finish before running the next one. For example, you need to wait until the dependencies are all cloned before you can build the changed model.

MRs to add new tables or columns to the Postgres/GitLab.com pipeline

In order to facilitate self-service on requests to add a new table or column to the Postgres/GitLab.com pipeline, the Data team created a runbook and videos to outline the required steps:

Getting a review

Before assigning any reviewers, make sure that you go through the checklist in the MR templates. This will ensure that you completed all required steps before a member of the Data team starts reviewing the code.

In the Auditing section of the MR template, make sure that all queries refer to the MR branch database. Other team members (ex: Analytics Engineers) reviewing the MR will also have access to the MR database and that is how they will review and validate the changes.

Once the MR is ready for review, tag code owners and assign as reviewers (you will see code owners listed in the approval section of the MR). For changes to workspace models, please be sure to have a functional analyst review the MR before tagging a member of the Data team as a reviewer.

Getting the MR merged

Once the MR is approved by the required number of code owners (as specified in the approval section of the MR), you can assign to a maintainer for final review and merge.