Skip to main content

Developer Blog | dbt Developer Hub

Find tutorials, product updates, and developer insights in the dbt Developer Blog.

Start here

· 9 min read
Amy Chen

Overview

Over the course of my three years running the Partner Engineering team at dbt Labs, the most common question I've been asked is, How do we integrate with dbt? Because those conversations often start out at the same place, I decided to create this guide so I’m no longer the blocker to fundamental information. This also allows us to skip the intro and get to the fun conversations so much faster, like what a joint solution for our customers would look like.

This guide doesn't include how to integrate with dbt Core. If you’re interested in creating a dbt adapter, please check out the adapter development guide instead.

Instead, we're going to focus on integrating with dbt Cloud. Integrating with dbt Cloud is a key requirement to become a dbt Labs technology partner, opening the door to a variety of collaborative commercial opportunities.

Here I'll cover how to get started, potential use cases you want to solve for, and points of integrations to do so.

New to dbt Cloud?

If you're new to dbt and dbt Cloud, we recommend you and your software developers try our Getting Started Quickstarts after reading What is dbt. The documentation will help you familiarize yourself with how our users interact with dbt. By going through this, you will also create a sample dbt project to test your integration.

If you require a partner dbt Cloud account to test on, we can upgrade an existing account or a trial account. This account may only be used for development, training, and demonstration purposes. Please contact your partner manager if you're interested and provide the account ID (provided in the URL). Our partner account includes all of the enterprise level functionality and can be provided with a signed partnerships agreement.

Integration points

  • Discovery API (formerly referred to as Metadata API)
    • Overview This GraphQL API allows you to query the metadata that dbt Cloud generates every time you run a dbt project. We have two schemas available (environment and job level). By default, we always recommend that you integrate with the environment level schema because it contains the latest state and historical run results of all the jobs run on the dbt Cloud project. The job level will only provide you the metadata of one job, giving you only a small snapshot of part of the project.
  • Administrative (Admin) API
    • Overview This REST API allows you to orchestrate dbt Cloud jobs runs and help you administer a dbt Cloud account. For metadata retrieval, we recommend integrating with the Discovery API instead.
  • Webhooks
    • Overview Outbound webhooks can send notifications about your dbt Cloud jobs to other systems. These webhooks allow you to get the latest information about your dbt jobs in real time.
  • Semantic Layers/Metrics
    • Overview Our Semantic Layer is made up of two parts: metrics definitions and the ability to interactively query the dbt metrics. For more details, here is a basic overview and our best practices.
    • Metrics definitions can be pulled from the Discovery API (linked above) or the Semantic Layer Driver/GraphQL API. The key difference is that the Discovery API isn't able to pull the semantic graph, which provides the list of available dimensions that one can query per metric. That is only available with the SL Driver/APIs. The trade-off is that the SL Driver/APIs doesn't have access to the lineage of the entire dbt project (that is, how the dbt metrics dependencies on dbt models).
    • Three integration points are available for the Semantic Layer API.

dbt Cloud hosting and authentication

To use the dbt Cloud APIs, you'll need access to the customer’s access urls. Depending on their dbt Cloud setup, they'll have a different access URL. To find out more, refer to Regions & IP addresses to understand all the possible configurations. My recommendation is to allow the customer to provide their own URL to simplify support.

If the customer is on an Azure single tenant instance, they don't currently have access to the Discovery API or the Semantic Layer APIs.

For authentication, we highly recommend that your integration uses account service tokens. You can read more about how to create a service token and what permission sets to provide it. Please note that depending on their plan type, they'll have access to different permission sets. We do not recommend that users supply their user bearer tokens for authentication. This can cause issues if the user leaves the organization and provides you access to all the dbt Cloud accounts associated to the user rather than just the account (and related projects) that they want to integrate with.

Potential use cases

  • Event-based orchestration
    • Desired action You want to receive information that a scheduled dbt Cloud job has been completed or has kicked off a dbt Cloud job. You can align your product schedule to the dbt Cloud run schedule.
    • Examples Kicking off a dbt job after the ETL job of extracting and loading the data is completed. Or receiving a webhook after the job has been completed to kick off your reverse ETL job.
    • Integration points Webhooks and/or Admin API
  • dbt lineage
    • Desired action You want to interpolate the dbt lineage metadata into your tool.
    • Example In your tool, you want to pull in the dbt DAG into your lineage diagram. For details on what you could pull and how to do this, refer to Use cases and examples for the Discovery API.
    • Integration points Discovery API
  • dbt environment/job metadata
    • Desired action You want to interpolate the dbt Cloud job information into your tool, including the status of the jobs, the status of the tables executed in the run, what tests passed, etc.
    • Example In your Business Intelligence tool, stakeholders select from tables that a dbt model created. You show the last time the model passed its tests/last run to show that the tables are current and can be trusted. For details on what you could pull and how to do this, refer to What's the latest state of each model.
    • Integration points Discovery API
  • dbt model documentation
    • Desired action You want to interpolate the dbt project Information, including model descriptions, column descriptions, etc.
    • Example You want to extract the dbt model description so you can display and help the stakeholder understand what they are selecting from. This way, the creators can easily pass on the information without updating another system. For details on what you could pull and how to do this, refer to What does this dataset and its columns mean.
    • Integration points Discovery API

dbt Core only users will have no access to the above integration points. For dbt metadata, oftentimes our partners will create a dbt Core integration by using the dbt artifact files generated by each run and provided by the user. With the Discovery API, we are providing a dynamic way to get the latest information parsed out for you.

dbt Cloud plans & permissions

The dbt Cloud plan type will change what the user has access to. There are four different types of plans:

  • Developer This is free and available to one user with a limited amount of successful models built. This plan can't access the APIs, Webhooks, or Semantic Layer and is limited to just one project.
  • Team This plan provides access to the APIs, webhooks, and Semantic Layer. You can have up to eight users on the account and one dbt Cloud Project. This is limited to 15,000 successful models built.
  • Enterprise (multi-tenant/multi-cell) This plan provides access to the APIs, webhooks, and Semantic Layer. You can have more than one dbt Cloud project based on how many dbt projects/domains they have using dbt. The majority of our enterprise customers are on multi-tenant dbt Cloud instances.
  • Enterprise (single tenant): This plan might have access to the APIs, webhooks, and Semantic Layer. If you're working with a specific customer, let us know and we can confirm if their instance has access.

FAQs

  • What is a dbt Cloud project?
    • A dbt Cloud project is made up of two connections: one to the Git repository and one to the data warehouse/platform. Most customers will have only one dbt Cloud project in their account but there are enterprise clients who might have more depending on their use cases. The project also encapsulates two types of environments at minimal: a development environment and deployment environment.
    • Folks commonly refer to the dbt project as the code hosted in their Git repository.
  • What is a dbt Cloud environment?
    • For an overview, check out About environments. At a minimum, a project will have one deployment type environment that they will be executing jobs on. The development environment powers the dbt Cloud IDE and Cloud CLI.
  • Can we write back to the dbt project?
    • At this moment, we don't have a Write API. A dbt project is hosted in a Git repository, so if you have a Git provider integration, you can manually open a pull request (PR) on the project to maintain the version control process.
  • Can you provide column-level information in the lineage?
    • Column-level lineage is currently in beta release with more information to come.
  • How do I get a Partner Account?
    • Contact your Partner Manager with your account ID (in your URL).
  • Why shouldn't I use the Admin API to pull out the dbt artifacts for metadata?
    • We recommend not integrating with the Admin API to extract the dbt artifacts documentation. This is because the Discovery API provides more extensive information, a user-friendly structure, and a more reliable integration point.
  • How do I get access to the dbt brand assets?
    • Check out our Brand guidelines page. Please make sure you’re not using our old logo (hint: there should only be one hole in the logo). Please also note that the name dbt and the dbt logo are trademarked by dbt Labs, and that use is governed by our brand guidelines, which are fairly specific for commercial uses. If you have any questions about proper use of our marks, please ask your partner manager.
  • How do I engage with the partnerships team?

· 9 min read
Jordan Stein

There’s nothing quite like the feeling of launching a new product. On launch day emotions can range from excitement, to fear, to accomplishment all in the same hour. Once the dust settles and the product is in the wild, the next thing the team needs to do is track how the product is doing. How many users do we have? How is performance looking? What features are customers using? How often? Answering these questions is vital to understanding the success of any product launch.

At dbt we recently made the Semantic Layer Generally Available. The Semantic Layer lets teams define business metrics centrally, in dbt, and access them in multiple analytics tools through our semantic layer APIs. I’m a Product Manager on the Semantic Layer team, and the launch of the Semantic Layer put our team in an interesting, somewhat “meta,” position: we need to understand how a product launch is doing, and the product we just launched is designed to make defining and consuming metrics much more efficient. It’s the perfect opportunity to put the semantic layer through its paces for product analytics. This blog post walks through the end-to-end process we used to set up product analytics for the dbt Semantic Layer using the dbt Semantic Layer.

· 5 min read
Joel Labes
The Bottom Line:

You should split your Jobs across Environments in dbt Cloud based on their purposes (e.g. Production and Staging/CI) and set one environment as Production. This will improve your CI experience and enable you to use dbt Explorer.

Environmental segmentation has always been an important part of the analytics engineering workflow:

  • When developing new models you can process a smaller subset of your data by using target.name or an environment variable.
  • By building your production-grade models into a different schema and database, you can experiment in peace without being worried that your changes will accidentally impact downstream users.
  • Using dedicated credentials for production runs, instead of an analytics engineer's individual dev credentials, ensures that things don't break when that long-tenured employee finally hangs up their IDE.

Historically, dbt Cloud required a separate environment for Development, but was otherwise unopinionated in how you configured your account. This mostly just worked – as long as you didn't have anything more complex than a CI job mixed in with a couple of production jobs – because important constructs like deferral in CI and documentation were only ever tied to a single job.

But as companies' dbt deployments have grown more complex, it doesn't make sense to assume that a single job is enough anymore. We need to exchange a job-oriented strategy for a more mature and scalable environment-centric view of the world. To support this, a recent change in dbt Cloud enables project administrators to mark one of their environments as the Production environment, just as has long been possible for the Development environment.

Explicitly separating your Production workloads lets dbt Cloud be smarter with the metadata it creates, and is particularly important for two new features: dbt Explorer and the revised CI workflows.

· 6 min read
Kshitij Aranke
Doug Beatty

Hi all, I’m Kshitij, a senior software engineer on the Core team at dbt Labs. One of the coolest moments of my career here thus far has been shipping the new dbt clone command as part of the dbt-core v1.6 release.

However, one of the questions I’ve received most frequently is guidance around “when” to clone that goes beyond the documentation on “how” to clone. In this blog post, I’ll attempt to provide this guidance by answering these FAQs:

  1. What is dbt clone?
  2. How is it different from deferral?
  3. Should I defer or should I clone?

· 11 min read
Amy Chen
note

This blog post was updated on December 18, 2023 to cover the support of MVs on dbt-bigquery and updates on how to test MVs.

Introduction

The year was 2020. I was a kitten-only household, and dbt Labs was still Fishtown Analytics. A enterprise customer I was working with, Jetblue, asked me for help running their dbt models every 2 minutes to meet a 5 minute SLA.

After getting over the initial terror, we talked through the use case and soon realized there was a better option. Together with my team, I created lambda views to meet the need.

Flash forward to 2023. I’m writing this as my giant dog snores next to me (don’t worry the cats have multiplied as well). Jetblue has outgrown lambda views due to performance constraints (a view can only be so performant) and we are at another milestone in dbt’s journey to support streaming. What. a. time.

Today we are announcing that we now support Materialized Views in dbt. So, what does that mean?

· 8 min read
Pedro Brito de Sa

Whether you are creating your pipelines into dbt for the first time or just adding a new model once in a while, good documentation and testing should always be a priority for you and your team. Why do we avoid it like the plague then? Because it’s a hassle having to write down each individual field, its description in layman terms and figure out what tests should be performed to ensure the data is fine and dandy. How can we make this process faster and less painful?

By now, everyone knows the wonders of the GPT models for code generation and pair programming so this shouldn’t come as a surprise. But ChatGPT really shines at inferring the context of verbosely named fields from database table schemas. So in this post I am going to help you 10x your documentation and testing speed by using ChatGPT to do most of the leg work for you.

· 15 min read
Rastislav Zdechovan
Sean McIntyre

Data Vault 2.0 is a data modeling technique designed to help scale large data warehousing projects. It is a rigid, prescriptive system detailed vigorously in a book that has become the bible for this technique.

So why Data Vault? Have you experienced a data warehousing project with 50+ data sources, with 25+ data developers working on the same data platform, or data spanning 5+ years with two or more generations of source systems? If not, it might be hard to initially understand the benefits of Data Vault, and maybe Kimball modelling is better for you. But if you are in any of the situations listed, then this is the article for you!

· 14 min read
Santiago Jauregui

Introduction

Most data modeling approaches for customer segmentation are based on a wide table with user attributes. This table only stores the current attributes for each user, and is then loaded into the various SaaS platforms via Reverse ETL tools.

Take for example a Customer Experience (CX) team that uses Salesforce as a CRM. The users will create tickets to ask for assistance, and the CX team will start attending them in the order that they are created. This is a good first approach, but not a data driven one.

An improvement to this would be to prioritize the tickets based on the customer segment, answering our most valuable customers first. An Analytics Engineer can build a segmentation to identify the power users (for example with an RFM approach) and store it in the data warehouse. The Data Engineering team can then export that user attribute to the CRM, allowing the customer experience team to build rules on top of it.

· 9 min read
Mikael Thorup

At Lunar, most of our dbt models are sourcing from event-driven architecture. As an example, we have the following models for our activity_based_interest folder in our ingestion layer:

  • activity_based_interest_activated.sql
  • activity_based_interest_deactivated.sql
  • activity_based_interest_updated.sql
  • downgrade_interest_level_for_user.sql
  • set_inactive_interest_rate_after_july_1st_in_bec_for_user.sql
  • set_inactive_interest_rate_from_july_1st_in_bec_for_user.sql
  • set_interest_levels_from_june_1st_in_bec_for_user.sql

This results in a lot of the same columns (e.g. account_id) existing in different models, across different layers. This means I end up:

  1. Writing/copy-pasting the same documentation over and over again
  2. Halfway through, realizing I could improve the wording to make it easier to understand, and go back and update the .yml files I already did
  3. Realizing I made a syntax error in my .yml file, so I go back and fix it
  4. Realizing the columns are defined differently with different wording being used in other folders in our dbt project
  5. Reconsidering my choice of career and pray that a large language model will steal my job
  6. Considering if there’s a better way to be generating documentation used across different models

· 18 min read
Sterling Paramore

This article covers an approach to handling time-varying ragged hierarchies in a dimensional model. These kinds of data structures are commonly found in manufacturing, where components of a product have both parents and children of arbitrary depth and those components may be replaced over the product's lifetime. The strategy described here simplifies many common types of analytical and reporting queries.

To help visualize this data, we're going to pretend we are a company that manufactures and rents out eBikes in a ride share application. When we build a bike, we keep track of the serial numbers of the components that make up the bike. Any time something breaks and needs to be replaced, we track the old parts that were removed and the new parts that were installed. We also precisely track the mileage accumulated on each of our bikes. Our primary analytical goal is to be able to report on the expected lifetime of each component, so we can prioritize improving that component and reduce costly maintenance.

· 12 min read
Sung Won Chung
Kira Furuichi

I, Sung, entered the data industry by chance in Fall 2014. I was using this thing called audit command language (ACL) to automate debits equal credits for accounting analytics (yes, it’s as tedious as it sounds). I remember working my butt off in a hotel room in Des Moines, Iowa where the most interesting thing there was a Panda Express. It was late in the AM. I’m thinking about 2 am. And I took a step back and thought to myself, “Why am I working so hard for something that I just don’t care about with tools that hurt more than help?”

· 5 min read
Callum McCann

Hello, my dear data people.

If you haven’t read Nick & Roxi’s blog post about what’s coming in the future of the dbt Semantic Layer, I highly recommend you read through that, as it gives helpful context around what the future holds.

With that said, it has come time for us to bid adieu to our beloved dbt_metrics package. Upon the release of dbt-core v1.6 in late July, we will be deprecating support for the dbt_metrics package.

· 12 min read
Arthur Marcon
Lucas Bergo Dias
Christian van Bellen

Alteryx is a visual data transformation platform with a user-friendly interface and drag-and-drop tools. Nonetheless, Alteryx may have difficulties to cope with the complexity increase within an organization’s data pipeline, and it can become a suboptimal tool when companies start dealing with large and complex data transformations. In such cases, moving to dbt can be a natural step, since dbt is designed to manage complex data transformation pipelines in a scalable, efficient, and more explicit manner. Also, this transition involved migrating from on-premises SQL Server to Snowflake cloud computing. In this article, we describe the differences between Alteryx and dbt, and how we reduced a client's 6-hour runtime in Alteryx to 9 minutes with dbt and Snowflake at Indicium Tech.

· 20 min read
Jonathan Neo
Dimensional modeling is one of many data modeling techniques that are used by data practitioners to organize and present data for analytics. Other data modeling techniques include Data Vault (DV), Third Normal Form (3NF), and One Big Table (OBT) to name a few.Data modeling techniques on a normalization vs denormalization scaleData modeling techniques on a normalization vs denormalization scale

While the relevance of dimensional modeling has been debated by data practitioners, it is still one of the most widely adopted data modeling technique for analytics.

Despite its popularity, resources on how to create dimensional models using dbt remain scarce and lack detail. This tutorial aims to solve this by providing the definitive guide to dimensional modeling with dbt.

By the end of this tutorial, you will:

  • Understand dimensional modeling concepts
  • Set up a mock dbt project and database
  • Identify the business process to model
  • Identify the fact and dimension tables
  • Create the dimension tables
  • Create the fact table
  • Document the dimensional model relationships
  • Consume the dimensional model

· 12 min read
João Antunes
Yannick Misteli
Sean McIntyre

Teams thrive when each team member is provided with the tools that best complement and enhance their skills. You wouldn’t hand Cristiano Ronaldo a tennis racket and expect a perfect serve! At Roche, getting the right tools in the hands of our teammates was critical to our ability to grow our data team from 10 core engineers to over 100 contributors in just two years. We embraced both dbt Core and dbt Cloud at Roche (a dbt-squared solution, if you will!) to quickly scale our data platform.

· 7 min read
Benoit Perigaud

Editor's note—this post assumes intermediate knowledge of Jinja and macros development in dbt. For an introduction to Jinja in dbt check out the documentation and the free self-serve course on Jinja, Macros, Pacakages.

Jinja brings a lot of power to dbt, allowing us to use ref(), source() , conditional code, and macros. But, while Jinja brings flexibility, it also brings complexity, and like many times with code, things can run in expected ways.

The debug() macro in dbt is a great tool to have in the toolkit for someone writing a lot of Jinja code, but it might be difficult to understand how to use it and what benefits it brings.

Let’s dive into the last time I used debug() and how it helped me solve bugs in my code.

· 15 min read
Arthur Marcon
Lucas Bergo Dias
Christian van Bellen

Auditing tables is a major part of analytics engineers’ daily tasks, especially when refactoring tables that were built using SQL Stored Procedures or Alteryx Workflows. In this article, we present how the audit_helper package can (as the name suggests) help the table auditing process to make sure a refactored model provides (pretty much) the same output as the original one, based on our experience using this package to support our clients at Indicium Tech®.

· 9 min read
Callie White
Jade Milaney

The new dbt Certification Program has been created by dbt Labs to codify the data development best practices that enable safe, confident, and impactful use of dbt. Taking the Certification allows dbt users to get recognized for the skills they’ve honed, and stand out to organizations seeking dbt expertise.

Over the last few months, Montreal Analytics, a full-stack data consultancy servicing organizations across North America, has had over 25 dbt Analytics Engineers become certified, earning them the 2022 dbt Platinum Certification award.

In this article, two Montreal Analytics consultants, Jade and Callie, discuss their experience in taking, and passing, the dbt Certification exam to help guide others looking to study for, and pass the exam.

· 8 min read
Noah Kennedy

Testing the quality of data in your warehouse is an important aspect in any mature data pipeline. One of the biggest blockers for developing a successful data quality pipeline is aggregating test failures and successes in an informational and actionable way. However, ensuring actionability can be challenging. If ignored, test failures can clog up a pipeline and create unactionable noise, rendering your testing infrastructure ineffective.