How We Optimized BigQuery Partition Filtering
Table of contents
- Background
- Scenario design
- Initial approach
- What went wrong
- New approach
- Closing thoughts
- Postscript
1. Background
At IOTRUST we operate the D’CENT Wallet app, and like most product teams we collect app usage data through GA4 and stream it into BigQuery. On top of that warehouse we run:
- BI dashboards — Redash
- Core-metric alerts — Slack via the Slack API
- Ad-hoc analysis — Python and DBeaver clients

D’CENT app data pipeline
Early on, every team was reading the GA4 Export source tables directly to build dashboards and run analyses. That approach quickly started to creak:
- The source tables kept growing, so per-query scan cost rose accordingly.
- The schema is full of nested-repeated columns (
event_params,user_properties,items), which raised the bar for non-engineers to query it. - We wanted to extend the warehouse with text-to-SQL features later on, which only works well when the underlying data mart has clean, well-defined tables.
With those problems in mind, the first step was to use dbt to build out the Core Layer of the warehouse.

Author’s illustration
2. Scenario design
While designing the central Fact table for all event logs — fct_events — we ran into an unexpected wall: the update cadence of the GA4 Export source tables is irregular.
There are two types of tables to deal with:
events_YYYYMMDD— daily-partitioned event logsevents_intraday_YYYYMMDD— real-time-streamed intraday event logs

Author’s illustration
Both tables effectively take about 3 days to settle into their final shape:
events_YYYYMMDDis typically created 1–2 days after the actual event date, and additional rows trickle in for another day after that before the partition is “done”.events_intraday_YYYYMMDDis created in real time, but it is missing some fields (notably UTM parameters), so the rows ultimately have to be re-migrated into the correspondingevents_YYYYMMDDpartition for full fidelity.
You can see this in the __TABLES__ metadata:
SELECT
table_id,
FORMAT_TIMESTAMP('%F %T', TIMESTAMP_MILLIS(creation_time)) AS creation_time,
FORMAT_TIMESTAMP('%F %T', TIMESTAMP_MILLIS(last_modified_time)) AS last_modified_time
FROM
`project_id.dataset_id.__TABLES__`
WHERE
STARTS_WITH(table_id, 'events_') = TRUE
OR STARTS_WITH(table_id, 'events_intraday_') = TRUE
ORDER BY
table_id DESC

Each D’CENT
datepartition table keeps receiving updates for 2–3 days after creation.
Because the value of data drops sharply when it lags, we needed a warehouse with no 3-day delay — every date partition should reflect both the initial creation and any later modifications as soon as they happen.
So we designed the following scenario:
- New
datepartitions should be picked up by orchestration in real time. - Existing
datepartitions should also be re-processed whenever they receive a late update.

Author’s illustration
In other words: dynamically detect creation and updates of each date partition in real time, and reflect those changes across every downstream warehouse table.
3. Initial approach
Our first instinct was to solve this entirely inside BigQuery, using dbt.
(1) stg_ga4_tables.sql — an Ephemeral model that queries __TABLES__ to get the latest metadata for every partition:
SELECT
table_id,
RIGHT(table_id, 8) AS table_suffix,
CASE
WHEN CONTAINS_SUBSTR(table_id, 'intraday') THEN 'intraday'
ELSE 'daily'
END AS table_type,
creation_time AS creation_timestamp_millis,
last_modified_time AS last_modified_timestamp_millis
FROM
`project_id.dataset_id.__TABLES__`
WHERE
STARTS_WITH(table_id, 'events_') = TRUE
OR STARTS_WITH(table_id, 'events_intraday_') = TRUE
(2) fct_events.sql — an Incremental model that dynamically filters source-table partitions using the metadata above:
-- depends_on: {{ ref('stg_ga4_tables') }}
SELECT
...
FROM
`project_id.dataset_id.events_*`
WHERE
...
{% if var("last_run_timestamp_millis", None) is not none %}
AND _table_suffix IN (
SELECT
table_suffix
FROM
{{ ref('stg_ga4_tables') }}
WHERE
{{ var("last_run_timestamp_millis") }} < last_modified_timestamp_millis
AND table_type = 'daily'
)
{% endif %}
The idea: compare the dbt variable last_run_timestamp_millis (set at orchestration time) against last_modified_timestamp_millis from the metadata table, and pull only the partitions that have been touched since the last run — so we never have to do a full scan.
4. What went wrong
It didn’t work — and not in a small way. The symptom was severe: “Partition filtering was not being applied as intended, so every orchestration run scanned the entire source table and racked up a huge query bill.”
If you read the compiled BigQuery SQL, it looks logically correct. But the partition filter is silently a no-op:
SELECT
...
FROM
`project_id.dataset_id.events_*`
WHERE
_table_suffix IN (
SELECT
table_suffix
FROM (
SELECT
table_id,
RIGHT(table_id, 8) AS table_suffix,
CASE
WHEN CONTAINS_SUBSTR(table_id, 'intraday') THEN 'intraday'
ELSE 'daily'
END AS table_type,
creation_time AS creation_timestamp_millis,
last_modified_time AS last_modified_timestamp_millis
FROM
`project_id.dataset_id.__TABLES__`
WHERE
STARTS_WITH(table_id, 'events_') = TRUE
OR STARTS_WITH(table_id, 'events_intraday_') = TRUE
)
WHERE
1725189555000 < last_modified_timestamp_millis
AND table_type = 'daily'
)

Testing on the BigQuery Public Datasets GA4 Export sample shows the partition filter is not applied.
Why? Because the approach violated BigQuery’s partition pruning rules.
Partition pruning is the Google Cloud mechanism that lets BigQuery skip unnecessary partitions and avoid charging for them. But per the official documentation:
“Do not use partition filters as dynamic expressions.”
“The following query does not prune partitions because the
WHERE t1.ts = (SELECT timestamp from table where key = 2)filter is not a constant expression.”
In other words, BigQuery decides whether it can prune partitions at Dry Run time — and at Dry Run time, the partition filter needs to be a constant expression. Our original design pushed partition management into a subquery so it could be “dynamic”, and that’s exactly the shape BigQuery refuses to prune.
5. New approach
Instead of trying to manage the partition list entirely inside BigQuery, we moved that responsibility out to the orchestration layer (Airflow) and fed BigQuery a static list of partition suffixes through a dbt variable.
(1) dag.py — the Airflow DAG’s main() queries the metadata table and serializes the result into the environment variable table_suffixes.
For example: table_suffixes = "20240901,20240902,20240903"
...
query = f"""
WITH
CTE_tables AS (
SELECT
table_id,
RIGHT(table_id, 8) AS table_suffix,
CASE
WHEN CONTAINS_SUBSTR(table_id, 'intraday') THEN 'intraday'
ELSE 'daily'
END AS table_type,
creation_time AS creation_timestamp_millis,
last_modified_time AS last_modified_timestamp_millis
FROM
`{bigquery_table_id}`
WHERE
STARTS_WITH(table_id, 'events_') = TRUE
OR STARTS_WITH(table_id, 'events_intraday_') = TRUE
)
SELECT
table_suffix
FROM
CTE_tables
WHERE
{last_run_timestamp_millis} < last_modified_timestamp_millis
"""
df = bigquery_client.query(query).to_dataframe()
table_suffixes = df.table_suffix.tolist()
if table_suffixes:
table_suffixes = ','.join(table_suffixes)
...
(2) fct_events.sql — _table_suffix is now compared directly against the table_suffixes variable, which arrives as a literal string at compile time:
-- depends_on: {{ ref('stg_ga4_tables') }}
SELECT
...
FROM
`project_id.dataset_id.events_*`
WHERE
...
{% if var("table_suffixes", None) is not none %}
AND _table_suffix IN ('{{ var("table_suffixes") }}')
{% endif %}
Now the partition filter compiles into a constant expression, BigQuery can prune partitions at Dry Run time as expected, and the full scans stop — giving us the cost optimization we were after.
6. Closing thoughts
Before this exercise, I had been using BigQuery’s partitioning features passively — turning them on, trusting that things would be cheap, and never really learning how partition pruning actually decides what to skip. Building out the warehouse forced me to confront two things I’d taken for granted:
- BigQuery makes the partition-pruning decision at Dry Run time (i.e. before the query actually runs).
- For pruning to fire, the partition filter has to satisfy specific shape requirements — chief among them, it must be a constant expression.
There’s a particular kind of learning that only comes from running into a wall and digging your way out of it. That hands-on debugging cycle made me far more comfortable inside BigQuery and, as a side effect, made our whole data platform meaningfully more efficient.
Thanks for reading!

Data Analytics Survival community KakaoTalk open-chat room
7. Postscript
A big thank-you to the members of the “데이터 분석으로 생존하기” (Surviving via Data Analytics) community KakaoTalk room, who tossed ideas back and forth with me while I was working through this problem.

Data Analytics Survival community KakaoTalk open-chat room