What is dbt - and how to set up your dbt developer environment?

Most of the world nowadays agrees, that having some sort of data warehouse is key for a successful data operation. Each and every company - no matter how big or small - has quite a huge number of tools which all produce data. Combining the data of these tools is essential to make use of them. That's - in short - the sales-pitch for a data warehouse. Compared to the very expensive and huge "enterprise warehouses" from 20 years ago, there is a new breed of warehouses available. For marginal cost and with incredible scalability. The "modern data warehouse" - a warehouse which is feasible for any company - no matter of budget or size.

Now, let's assume that we succeeded in establishing the warehouse technology and integrating all our data sources into this warehouse. What do we have? A bunch of raw data, all organized in a central place. Which is already good - but now comes the work of making use of them.

And here is where dbt comes into play. This post will cover to basic topics:

What is dbt? and
How to set up dbt?

What is dbt?

It's rare these days that one can simply navigate to a website of a tool and get a perfect description of what it does. dbt actually is one of these tools. So, I've the pleasure to simply quote dbt's website:

dbt transforms data in your data warehouse and helps to data engineers to work like software engineers - to ship trusted data, incrementally and fast.

dbt does this by providing tools to:

use supercharged SQL (more on that later) to transform your raw data in usable, well-defined models
provide easy-to maintain documentation tools
add a thing data testing layer to make sure your data are what you expect

This marketing image summarizes this quite well.

dbt Marketing Slide

Summarized, dbt provides us an easy to use, easy to maintain way of modelling our raw data into multiple ready-to-use datasets which can be utilized by more downstream operations like machine learning, BI and data analytics. dbt does this, by providing a powerful SQL interface and tools for documentation and testing.

If you want to see dbt in action, skip to section How to model data with dbt?.

How to set up dbt?

In this section we are installing dbt and create a nice and tidy vscode development setup.

dbt core - the open source and well... core offering of dbt is actually just a python package one needs to install.

Besides the dbt main application, you can install one or several database adapters.

Run these commands to install dbt.

1# Optional but highly recommended, create and activate a virtual environment
2# Instead of venv, you might also use anaconda for environment management
3python3 -m vevn dbt-venv
4source dbt-venv/bin/activate
5
6# Install the core application
7pip install dbt-core
8
9# Install one or several of the following adapters
10
11pip install dbt-bigquery # If you want to model data in your bigquery data warehouse
12pip install dbt-athena-adapter # If you want to model data in your athena data warehouse
13pip install dbt-postgres # If you want to model data in your postgres data warehouse
14pip install dbt-synapse # If you want to model data in your synapse data warehouse
15pip install dbt-databricks # If you want to model data in your databricks platform
16pip install dbt-redshift # If you want to model data in your redshift data warehouse
17pip install dbt-snowflake # If you want to model data in your snowflake data warehouse
18pip install dbt-spark # If you want to model data in your spark platform

These are all the dependencies one needs, to start developing with dbt!

For developing our data models we are going to use VSCode - as it provides awesome extensions which make working with dbt fun and exciting.

If not already done, install vscode

Set up the folder structure, as follows

Add a .gitignore file with the following content

1target/
2dbt_modules/
3dbt_packages/
4logs/
5venv/
6env/
7test.env
8__pycache__
9.vscode
10export.json

Run dbt init and follow the screen. The init application might ask different questions, depending on which database adapter you are going to use. In this example, I'm using BigQuery as an example (make sure you have installed the appropriate adapter - see the pip install steps above)

1dbt init
2
3# This is the result
417:06:52  Running with dbt=1.4.5
5Enter a name for your project (letters, digits, underscore): waffle_shop
6Which database would you like to use?
7[1] bigquery
8
9(Dont see the one you want? https://docs.getdbt.com/docs/available-adapters)
10
11Enter a number: 1
12[1] oauth
13[2] service_account
14Desired authentication method option (enter a number): 2
15keyfile (/path/to/bigquery/keyfile.json): /tmp/google_key.json
16project (GCP project id): devopsandmore
17dataset (the name of your dbt dataset): dbt_devops
18threads (1 or more): 8
19job_execution_timeout_seconds [300]: 300
20[1] US
21[2] EU
22Desired location option (enter a number): 2
2317:07:25  Profile waffle_shop written to /home/andreas/.dbt/profiles.yml using targets profile_template.yml and your supplied values. Run 'dbt debug' to validate the connection.
2417:07:25
25Your new dbt project "waffle_shop" was created!
26
27For more information on how to configure the profiles.yml file,
28please consult the dbt documentation here:
29
30https://docs.getdbt.com/docs/configure-your-profile
31
32One more thing:
33
34Need help? Don't hesitate to reach out to us via GitHub issues or on Slack:
35
36https://community.getdbt.com/
37
38Happy modeling!

Finishing this installer will create the following folder structure

1.
2├── logs
3│   └── dbt.log
4└── waffle_shop
5    ├── README.md
6    ├── analyses
7    ├── dbt_project.yml
8    ├── macros
9    ├── models
10    │   └── example
11    │       ├── my_first_dbt_model.sql
12    │       ├── my_second_dbt_model.sql
13    │       └── schema.yml
14    ├── seeds
15    ├── snapshots
16    └── tests

logs: This folder contains the logs of the dbt init operation. You can delete it.
waffle_shop: This is the main dbt project folder. All your model sql files, description files and settings are located here. The name of the folder is the name of the dbt project you entered during dbt init.
README.md: Readme file. Feel free to change.
analyses: Folder which might contain dbt analyses. Despite it's name, analyses are not the main thing in dbt - "models" are
dbt_project.yaml: Project settings file
macros: Folder to store dbt macros
models: This is the folder you will spend most of your time in. It contains all dbt models - as a reminder, dbt models are "just" sql files, enhanced with jinja template code.
example: Delete this folder
seeds: Folder which might contain dbt seeds
snapshots: Folder which might contain dbt snapshots
tests: Folder containing custom dbt tests

Install vscode-dbt. This adds support for the dbt syntax (dbt-sql, jinja, yaml, markdown)

Install sqlfluff

1pip install sqlfluff

Install the sqlfluff vscode extension

Add a file .sqlfluff to the toplevel folder, with the following content:

1[sqlfluff]
2# Choose the sql dialect to use
3dialect = bigquery
4# Use jinja templater instead of dbt, as the dbt templater is slow
5# and does not work with bigquery dialect
6templater = jinja
7# Optional: Exclude some rules
8# Find all rules here: https://docs.sqlfluff.com/en/stable/rules.html
9exclude_rules = L034, L032, L024, L015, L008, L001
10ignore_templated_areas = True
11large_file_skip_byte_limit = 200000
12
13[sqlfluff:templater:jinja]
14# Make dbt default macros available like `ref`
15apply_dbt_builtins = true
16# Load custom macros from the macro directory
17load_macros_from_path = ./macros/
18# Allow mocking dbt_utils and other packages through
19# python files placed in this dir
20library_path = ./sqlfluff_libs/
21
22# Mock variables that are used in dbt
23[sqlfluff:templater:jinja:context]
24target = prod
25
26[sqlfluff:rules]
27tab_space_size = 4
28max_line_length = 100

Add a file .sqlfluffignore, with following lines:

1target/
2dbt_modules/
3dbt_packages/
4macros/
5tests/

Add a folder .vscode at the same folder level as logs and waffle_shop.

Add a file settings.json inside the .vscode folder, having the following content:

1{
2    // change this to where your python dbt virtual environment python binary is located
3    "python.python.defaultInterpreterPath": "./.dbtenv/bin/python",
4    
5    "files.associations": {
6        // associate all sql files in the folder with jinja sql
7        "*.sql": "jinja-sql",
8    
9        // optional: don't format models in `target/` dir.
10        // This disables syntax highlighting in the target folder
11        // This helps to very quickly identify, that you are not in a model file, 
12        // but a compiled target file.
13        "**/target/**": ""
14    },
15    "editor.quickSuggestions": {
16        "strings": true
17    },
18    
19    // run "whereis sqlfluff" to find the executable path of your sqlfluff installation
20    "sqlfluff.executablePath": "/path/to/dbt-venv/bin/sqlfluff",
21    "sqlfluff.dialect": "bigquery",
22    "sqlfluff.linter.run": "onSave",
23    "sqlfluff.experimental.format.executeInTerminal": true,
24    "editor.formatOnSave": false,
25    "dbt.queryLimit": 1000,
26    "editor.rulers": [100]
27}

Install the dbt power user vscode extension

If you follow the steps above you'll get a basic dbt setup and vscode as sort of dbt IDE. The developer experience is quite nice and you have a development system at hand to start working with dbt.

In the next post, you'll see how to create your first dbt data model, based on some raw-data in your data warehouse.

------------------

Interested in how to train your very own Large Language Model?

We prepared a well-researched guide for how to use the latest advancements in Open Source technology to fine-tune your own LLM. This has many advantages like:

Cost control
Data privacy
Excellent performance - adjusted specifically for your intended use

Get your free LLM training guide

What is dbt - and how to set up your dbt developer environment?

What is dbt?

How to set up dbt?

Interested in how to train your very own Large Language Model?

Need assistance?