What is dbt - and how to set up your dbt developer environment?
Most of the world nowadays agrees, that having some sort of data warehouse is key for a successful data operation. Each and every company - no matter how big or small - has quite a huge number of tools which all produce data. Combining the data of these tools is essential to make use of them. That's - in short - the sales-pitch for a data warehouse. Compared to the very expensive and huge "enterprise warehouses" from 20 years ago, there is a new breed of warehouses available. For marginal cost and with incredible scalability. The "modern data warehouse" - a warehouse which is feasible for any company - no matter of budget or size.
Now, let's assume that we succeeded in establishing the warehouse technology and integrating all our data sources into this warehouse. What do we have? A bunch of raw data, all organized in a central place. Which is already good - but now comes the work of making use of them.
And here is where dbt comes into play. This post will cover to basic topics:
What is dbt?
It's rare these days that one can simply navigate to a website of a tool and get a perfect description of what it does. dbt actually is one of these tools. So, I've the pleasure to simply quote dbt's website:
dbt transforms data in your data warehouse and helps to data engineers to work like software engineers - to ship trusted data, incrementally and fast.
dbt does this by providing tools to:
- use supercharged SQL (more on that later) to transform your raw data in usable, well-defined models
- provide easy-to maintain documentation tools
- add a thing data testing layer to make sure your data are what you expect
This marketing image summarizes this quite well.
dbt Marketing Slide
Summarized, dbt provides us an easy to use, easy to maintain way of modelling our raw data into multiple ready-to-use datasets which can be utilized by more downstream operations like machine learning, BI and data analytics. dbt does this, by providing a powerful SQL interface and tools for documentation and testing.
If you want to see dbt in action, skip to section How to model data with dbt?.
How to set up dbt?
In this section we are installing dbt and create a nice and tidy vscode development setup.
dbt core - the open source and well... core offering of dbt is actually just a python package one needs to install.
Besides the dbt main application, you can install one or several database adapters.
Run these commands to install dbt.
1# Optional but highly recommended, create and activate a virtual environment2# Instead of venv, you might also use anaconda for environment management3python3 -m vevn dbt-venv4source dbt-venv/bin/activate56# Install the core application7pip install dbt-core89# Install one or several of the following adapters1011pip install dbt-bigquery # If you want to model data in your bigquery data warehouse12pip install dbt-athena-adapter # If you want to model data in your athena data warehouse13pip install dbt-postgres # If you want to model data in your postgres data warehouse14pip install dbt-synapse # If you want to model data in your synapse data warehouse15pip install dbt-databricks # If you want to model data in your databricks platform16pip install dbt-redshift # If you want to model data in your redshift data warehouse17pip install dbt-snowflake # If you want to model data in your snowflake data warehouse18pip install dbt-spark # If you want to model data in your spark platform
These are all the dependencies one needs, to start developing with dbt!
For developing our data models we are going to use VSCode - as it provides awesome extensions which make working with dbt fun and exciting.
- If not already done, install vscode
- Set up the folder structure, as follows
-
Add a .gitignore file with the following content
1target/2dbt_modules/3dbt_packages/4logs/5venv/6env/7test.env8__pycache__9.vscode10export.json -
Run
dbt init
and follow the screen. The init application might ask different questions, depending on which database adapter you are going to use. In this example, I'm using BigQuery as an example (make sure you have installed the appropriate adapter - see thepip install
steps above)1dbt init23# This is the result417:06:52 Running with dbt=1.4.55Enter a name for your project (letters, digits, underscore): waffle_shop6Which database would you like to use?7[1] bigquery89(Dont see the one you want? https://docs.getdbt.com/docs/available-adapters)1011Enter a number: 112[1] oauth13[2] service_account14Desired authentication method option (enter a number): 215keyfile (/path/to/bigquery/keyfile.json): /tmp/google_key.json16project (GCP project id): devopsandmore17dataset (the name of your dbt dataset): dbt_devops18threads (1 or more): 819job_execution_timeout_seconds [300]: 30020[1] US21[2] EU22Desired location option (enter a number): 22317:07:25 Profile waffle_shop written to /home/andreas/.dbt/profiles.yml using targets profile_template.yml and your supplied values. Run 'dbt debug' to validate the connection.2417:07:2525Your new dbt project "waffle_shop" was created!2627For more information on how to configure the profiles.yml file,28please consult the dbt documentation here:2930https://docs.getdbt.com/docs/configure-your-profile3132One more thing:3334Need help? Don't hesitate to reach out to us via GitHub issues or on Slack:3536https://community.getdbt.com/3738Happy modeling!Finishing this installer will create the following folder structure
1.2├── logs3│ └── dbt.log4└── waffle_shop5 ├── README.md6 ├── analyses7 ├── dbt_project.yml8 ├── macros9 ├── models10 │ └── example11 │ ├── my_first_dbt_model.sql12 │ ├── my_second_dbt_model.sql13 │ └── schema.yml14 ├── seeds15 ├── snapshots16 └── tests- logs: This folder contains the logs of the dbt init operation. You can delete it.
- waffle_shop: This is the main dbt project folder. All your model sql files, description files and settings are located here. The name of the folder is the name of the dbt project you entered during
dbt init
. - README.md: Readme file. Feel free to change.
- analyses: Folder which might contain dbt analyses. Despite it's name, analyses are not the main thing in dbt - "models" are
- dbt_project.yaml: Project settings file
- macros: Folder to store dbt macros
- models: This is the folder you will spend most of your time in. It contains all dbt models - as a reminder, dbt models are "just" sql files, enhanced with jinja template code.
- example: Delete this folder
- seeds: Folder which might contain dbt seeds
- snapshots: Folder which might contain dbt snapshots
- tests: Folder containing custom dbt tests
-
Install vscode-dbt. This adds support for the dbt syntax (dbt-sql, jinja, yaml, markdown)
-
Install sqlfluff
1pip install sqlfluff -
Install the sqlfluff vscode extension
-
Add a file
.sqlfluff
to the toplevel folder, with the following content:1[sqlfluff]2# Choose the sql dialect to use3dialect = bigquery4# Use jinja templater instead of dbt, as the dbt templater is slow5# and does not work with bigquery dialect6templater = jinja7# Optional: Exclude some rules8# Find all rules here: https://docs.sqlfluff.com/en/stable/rules.html9exclude_rules = L034, L032, L024, L015, L008, L00110ignore_templated_areas = True11large_file_skip_byte_limit = 2000001213[sqlfluff:templater:jinja]14# Make dbt default macros available like `ref`15apply_dbt_builtins = true16# Load custom macros from the macro directory17load_macros_from_path = ./macros/18# Allow mocking dbt_utils and other packages through19# python files placed in this dir20library_path = ./sqlfluff_libs/2122# Mock variables that are used in dbt23[sqlfluff:templater:jinja:context]24target = prod2526[sqlfluff:rules]27tab_space_size = 428max_line_length = 100 -
Add a file
.sqlfluffignore
, with following lines:1target/2dbt_modules/3dbt_packages/4macros/5tests/ -
Add a folder
.vscode
at the same folder level aslogs
andwaffle_shop
. -
Add a file
settings.json
inside the.vscode
folder, having the following content:1{2 // change this to where your python dbt virtual environment python binary is located3 "python.python.defaultInterpreterPath": "./.dbtenv/bin/python",45 "files.associations": {6 // associate all sql files in the folder with jinja sql7 "*.sql": "jinja-sql",89 // optional: don't format models in `target/` dir.10 // This disables syntax highlighting in the target folder11 // This helps to very quickly identify, that you are not in a model file,12 // but a compiled target file.13 "**/target/**": ""14 },15 "editor.quickSuggestions": {16 "strings": true17 },1819 // run "whereis sqlfluff" to find the executable path of your sqlfluff installation20 "sqlfluff.executablePath": "/path/to/dbt-venv/bin/sqlfluff",21 "sqlfluff.dialect": "bigquery",22 "sqlfluff.linter.run": "onSave",23 "sqlfluff.experimental.format.executeInTerminal": true,24 "editor.formatOnSave": false,25 "dbt.queryLimit": 1000,26 "editor.rulers": [100]27} -
Install the dbt power user vscode extension
-
If you follow the steps above you'll get a basic dbt setup and vscode as sort of dbt IDE. The developer experience is quite nice and you have a development system at hand to start working with dbt.
In the next post, you'll see how to create your first dbt data model, based on some raw-data in your data warehouse.
------------------
Interested in how to train your very own Large Language Model?
We prepared a well-researched guide for how to use the latest advancements in Open Source technology to fine-tune your own LLM. This has many advantages like:
- Cost control
- Data privacy
- Excellent performance - adjusted specifically for your intended use