Building a data warehouse with dbt: a retrospective
Contents
- Introduction
- Overview
- A tour of dbt’s features
- What I actually used dbt for
- Tips on getting the most out of dbt
- Closing thoughts
Introduction
dbt (Data Build Tool) is an open-source framework that automates data transformation and modeling, and it’s widely adopted by data analytics and data engineering teams. The framework leans on SQL plus Jinja templating to define and execute every transformation in the warehouse.
I leaned on dbt heavily while building out the data warehouse for IOTRUST’s Wepin Workspace, and I plan to keep applying it as we ship new features and run internal analyses going forward.
Overview

Wepin Workspace — the “wallet user statistics” feature
The “user statistics” feature was built using ETL tooling, with most of the work concentrated on the transformation and orchestration layers. End-to-end, the pipeline stretches from the operational database all the way to the UI:

End-to-end data pipeline architecture
(1) Operational database replication. Data is streamed in near real-time from the operational database into a physically separated database via a distributed data-processing service.
(2) Sequential transformation layers.
- Core Layer: data cleansing, column standardization, cardinality conversion, and null handling; modeled with Star or Snowflake schemas.
- Mart Layer: business-logic-driven metrics, separated by metric type; strict Star Schema for query performance.
- Access Layer: a final pass of refinement for backend consumption; guarantees a fully sequential dataset with no gaps.
(3) Backend query execution. A GraphQL layer queries the Access Layer tables. Compared to traditional REST endpoints, this gives the consumer much more flexibility in shaping each request.
(4) Frontend presentation. The UI fires GraphQL requests as the user interacts with the page, and the response data is rendered into charts.
A tour of dbt’s features
dbt is laser-focused on automating the transformation layer. The framework breaks down into five primary capabilities:

The five core tasks dbt automates
(1) Snapshot. Automates periodic and accumulating snapshot fact tables, capturing point-in-time state (revenue, costs, inventory, balances) on a regular cadence.
(2) Transform. Runs all the transformation tasks in the project while resolving their dependencies automatically, producing a DAG-style execution graph. This makes the pipeline stable and lets you maintain individual marts without breaking anything downstream.
(3) Test. Provides automated data quality testing through built-in and custom tests — in my opinion, dbt’s single biggest selling point.
(4) Deploy. Compiles SQL and Jinja templates into the database’s native SQL dialect and then executes the corresponding DDL and DML:
- DDL: Create, Alter, Drop, Truncate
- DML: Select, Insert, Update, Delete
(5) Document. Auto-generates documentation for every table, column, description, dependency, and test — which dramatically smooths handoffs and ongoing maintenance.
What I actually used dbt for
1. Organized table management through directory structure
I classified tables by layer, by fact-vs-dimension type, and by mart assignment so that maintenance and ad-hoc edits would stay sane.

Example directory structure
2. Automated data quality testing
I leaned on three flavors of tests.
(1) Built-in generic tests:
unique— validates unique identifiersnot_null— enforces non-null columnsaccepted_values— restricts to an allow-list of valuesrelationships— enforces foreign-key constraints
(2) Custom singular tests. User-defined tests aimed at a specific table/column where any returned row is treated as a failure:
SELECT * FROM {{ ref('fct_balances') }}
WHERE balance < 0 LIMIT 1
(3) Custom generic tests. Parameterized tests that you can apply across multiple tables and columns:
{% test is_even(model, column_name) %}
WITH validation AS (
SELECT {{ column_name }} as even_field
FROM {{ model }}
),
validation_errors AS (
SELECT even_field FROM validation
WHERE (even_field % 2) = 1
)
SELECT * FROM validation_errors
{% endtest %}
Example YAML configuration:
version: 2
models:
- name: dim_users
description: Users Info
columns:
- name: user_id
description: Primary key
tests:
- unique
- not_null
- name: country_id
tests:
- not_null
- relationships:
to: ref('dim_countries')
field: country_id
3. Metadata-driven documentation
Enriching the YAML files unlocks a surprisingly complete set of auto-generated docs:
node_color— visual color coding to group related tablesdescription— table- and column-level definitionsmeta— owner attribution for accountabilitytests— the quality requirements attached to each column

A sample of the documentation interface that dbt generates
4. Tag-based model organization
config:
tags: ["core", "fact", "incremental"]
materialized: incremental
incremental_strategy: append
Tags make it easy to filter the documentation site and to scope which models you transform or test during a given maintenance window.
5. Materialization strategy configuration

Comparison of dbt’s materialization strategies
(1) View — stores only the query logic; no physical table is created.
config:
materialized: view
(2) Table — recreates the entire table on every run. Typically what you want for dimension tables.
config:
materialized: table
(3) Incremental — appends new rows instead of doing a full rebuild. Typically what you want for fact tables.
config:
materialized: incremental
incremental_strategy: append
Or with a merge strategy:
config:
materialized: incremental
incremental_strategy: merge
unique_key: [user_id]
merge_update_columns: [country]
(4) Ephemeral — used only within the dbt project at compile time, with nothing persisted to the database. A solid fit for staging-layer transformations.
Personally, I reserve merge for SCD Type 1 tables and lean on append for everything else, since append is meaningfully better for query performance.
6. Schema organization for client visibility
models:
wepin_workspace_dbt:
01_core:
dim:
+schema: 01_core.dim
fct:
+schema: 01_core.fct
02_mart:
01_users_cnt:
00_users_events_daily:
+schema: 02_mart.01_users_cnt.00_users_events_daily
Defining schemas explicitly keeps things clear in the database client even when the internal dbt organization is deeply nested.
Tips on getting the most out of dbt
1. The recommended DAG execution sequence
The orchestration order I settled on:

DAG workflow: git pull → dbt debug → dbt deps → dbt run → dbt docs generate → dbt test → Slack alert
Verifying each stage independently is what makes the pipeline trustworthy, and a final Slack notification (via the Python Slack API) makes sure the result actually gets read.
2. Manage profiles.yaml at the project level
Keeping a separate profiles.yaml per project improves security and avoids centralizing connection credentials in a single shared file:
$ export DBT_PROFILES_DIR=path/to/directory
3. Incremental strategies are deceptively tricky
Writing an incremental model requires careful Jinja templating to avoid generating duplicate rows:
WHERE
...
{% if is_incremental() %}
AND (SELECT MAX(datetime) FROM {{ this }}) < "createdTime"
{% endif %}
Things worth keeping in mind:
- Design the model so it never produces duplicates no matter how often it runs.
- Core-layer tables benefit from carrying both a
dateand adatetimecolumn.
4. Decouple source naming with sources.yaml
Use sources.yaml to detach analytics-friendly names from the operational database’s naming conventions:
sources:
- name: users
description: "User-related tables"
tables:
- name: fct_events
identifier: ...
description: Event table
tags: ["source", "users", "fact"]
Reference it downstream as: {{ source('users', 'fct_events') }}
5. Sketch dependencies early — by hand
Early in the data-warehouse planning process, it pays to sketch the dependency graph manually using something like draw.io, well before you let dbt auto-generate its documentation.

An early dependency sketch drawn in draw.io
6. Finish the Core Layer before touching the Mart Layer
Locking in Core Layer definitions and materialization strategies before starting Mart Layer development prevents downstream confusion. Otherwise you end up making “bottom-up” corrections that erode the consistency of the whole layered model.
7. Always test for duplicate rows
The single most important incremental-model test catches accidental duplicates:
WITH CTE AS (
SELECT a, b, c,
ROW_NUMBER() OVER (PARTITION BY a, b, c) AS row_num
FROM table
)
SELECT * FROM CTE WHERE row_num > 1
Closing thoughts
I learned dbt through a handful of Udemy courses and then applied it end-to-end on Wepin Workspace. Tool choice is always contextual, but as the sole data engineer / analytics engineer on the project, dbt was indispensable — particularly for documentation and for catching data quality issues quickly.
The build also left me curious about how the enterprise data warehouses behind products like Google Analytics, Amplitude, and Mixpanel are architected, and deepened my appreciation for the broader data engineering tooling ecosystem.
I’m planning to keep using dbt on company projects going forward — and to keep contributing to my teammates’ growth as we go.