Skip to main content

About MetricFlow

This guide introduces MetricFlow's fundamental ideas for people new to this feature. MetricFlow, which powers the dbt Semantic Layer, helps you define and manage the logic for your company's metrics. It's an opinionated set of abstractions and helps data consumers retrieve metric datasets from a data platform quickly and efficiently.

MetricFlow handles SQL query construction and defines the specification for dbt semantic models and metrics. It allows you to define metrics in your dbt project and query them with MetricFlow commands whether in dbt Cloud or dbt Core.

Before you start, consider the following guidelines:

MetricFlow

MetricFlow is a SQL query generation tool designed to streamline metric creation across different data dimensions for diverse business needs.

  • It operates through YAML files, where a semantic graph links language to data. This graph comprises semantic models (data entry points) and metrics (functions for creating quantitative indicators).
  • MetricFlow is a BSL package with code source available, and compatible with dbt version 1.6 and higher. Data practitioners and enthusiasts are highly encouraged to contribute.
  • As a part of the dbt Semantic Layer, MetricFlow empowers organizations to define metrics using YAML abstractions.
  • To query metric dimensions, dimension values, and validate configurations, use MetricFlow commands.

Note MetricFlow doesn't support dbt builtin functions or packages at this time, however, support is planned for the future.

MetricFlow abides by these principles:

  • Flexibility with completeness: Define metric logic using flexible abstractions on any data model.
  • DRY (Don't Repeat Yourself): Minimize redundancy by enabling metric definitions whenever possible.
  • Simplicity with gradual complexity: Approach MetricFlow using familiar data modeling concepts.
  • Performance and efficiency: Optimize performance while supporting centralized data engineering and distributed logic ownership.

Semantic graph

We're introducing a new concept: a "semantic graph". It's the relationship between semantic models and YAML configurations that creates a data landscape for building metrics. You can think of it like a map, where tables are like locations, and the connections between them (edges) are like roads. Although it's under the hood, the semantic graph is a subset of the DAG, and you can see the semantic models as nodes on the DAG.

The semantic graph helps us decide which information is available to use for consumption and which is not. The connections between tables in the semantic graph are more about relationships between the information. This is different from the DAG, where the connections show dependencies between tasks.

When MetricFlow generates a metric, it uses its SQL engine to figure out the best path between tables using the framework defined in YAML files for semantic models and metrics. When these models and metrics are correctly defined, they can be used downstream with dbt Semantic Layer's integrations.

Semantic models

Semantic models are the starting points of data and correspond to models in your dbt project. You can create multiple semantic models from each model. Semantic models have metadata, like a data table, that define important information such as the table name and primary keys for the graph to be navigated correctly.

For a semantic model, there are three main pieces of metadata:

  • Entities The join keys of your semantic model (think of these as the traversal paths, or edges between semantic models).
  • Dimensions These are the ways you want to group or slice/dice your metrics.
  • Measures The aggregation functions that give you a numeric result and can be used to create your metrics.
A semantic model is made up of different components: Entities, Measures, and Dimensions.A semantic model is made up of different components: Entities, Measures, and Dimensions.

Metrics

Metrics, which is a key concept, are functions that combine measures, constraints, or other mathematical functions to define new quantitative indicators. MetricFlow uses measures and various aggregation types, such as average, sum, and count distinct, to create metrics. Dimensions add context to metrics and without them, a metric is simply a number for all time. You can define metrics in the same YAML files as your semantic models, or create a new file.

MetricFlow supports different metric types:

  • Cumulative Aggregates a measure over a given window.
  • Derived An expression of other metrics, which allows you to do calculations on top of metrics.
  • Ratio Create a ratio out of two measures, like revenue per customer.
  • Simple Metrics that refer directly to one measure.

Use case

In the upcoming sections, we'll show how data practitioners currently calculate metrics and compare it to how MetricFlow makes defining metrics easier and more flexible.

The following example data is based on the Jaffle Shop repo. You can view the complete dbt project. The tables we're using in our example model are:

  • orders is a production data platform export that has been cleaned up and organized for analytical consumption
  • customers is a partially denormalized table in this case with a column derived from the orders table through some upstream process

To make this more concrete, consider the metric order_total, which is defined using the SQL expression:

select sum(order_total) as order_total from orders This expression calculates the total revenue for all orders by summing the order_total column in the orders table. In a business setting, the metric order_total is often calculated according to different categories, such as"

  • Time, for example date_trunc(ordered_at, 'day')
  • Order Type, using is_food_order dimension from the orders table.

Calculate metrics

Next, we'll compare how data practitioners currently calculate metrics with multiple queries versus how MetricFlow simplifies and streamlines the process.

The following example displays how data practitioners typically would calculate the order_total metric aggregated. It's also likely that analysts are asked for more details on a metric, like how much revenue came from new customers.

Using the following query creates a situation where multiple analysts working on the same data, each using their own query method this can lead to confusion, inconsistencies, and a headache for data management.

select
date_trunc('day',orders.ordered_at) as day,
case when customers.first_ordered_at is not null then true else false end as is_new_customer,
sum(orders.order_total) as order_total
from
orders
left join
customers
on
orders.customer_id = customers.customer_id
group by 1, 2

FAQs

Do my datasets need to be normalized?
Not at all! While a cleaned and well-modeled data set can be extraordinarily powerful and is the ideal input, you can use any dataset from raw to fully denormalized datasets.

It's recommended that you apply quality data consistency, such as filtering bad data, normalizing common objects, and data modeling of keys and tables, in upstream applications. The Semantic Layer is more efficient at doing data denormalization instead of normalization.

If you have not invested in data consistency, that is okay. The Semantic Layer can take SQL queries or expressions to define consistent datasets.
Why is normalized data the ideal input?
MetricFlow is built to do denormalization efficiently. There are better tools to take raw datasets and accomplish the various tasks required to build data consistency and organized data models. On the other end, by putting in denormalized data you are potentially creating redundancy which is technically challenging to manage, and you are reducing the potential granularity that MetricFlow can use to aggregate metrics.
Why not just make metrics the same as measures?
One principle of MetricFlow is to reduce the duplication of logic sometimes referred to as Don't Repeat Yourself(DRY).

Many metrics are constructed from reused measures and in some cases constructed from measures from different semantic models. This allows for metrics to be built breadth-first (metrics that can stand alone) instead of depth-first (where you have multiple metrics acting as functions of each other).

Additionally, not all metrics are constructed off of measures. As an example, a conversion metric is likely defined as the presence or absence of an event record after some other event record.
How does the Semantic Layer handle joins?
MetricFlow builds joins based on the types of keys and parameters that are passed to entities. To better understand how joins are constructed see our documentation on join types.

Rather than capturing arbitrary join logic, MetricFlow captures the types of each identifier and then helps the user to navigate to appropriate joins. This allows us to avoid the construction of fan out and chasm joins as well as generate legible SQL.
Are entities and join keys the same thing?
If it helps you to think of entities as join keys, that is very reasonable. Entities in MetricFlow have applications beyond joining two tables, such as acting as a dimension.
Can a table without a primary or unique entities have dimensions?
Yes, but because a dimension is considered an attribute of the primary or unique ent of the table, they are only usable by the metrics that are defined in that table. They cannot be joined to metrics from other tables. This is common in event logs.
0