Identifying and Modeling Sessions from an Event Table

“I want to share how I built the dim__sessions model — a solution to the problem of having no session ID in the Mixpanel event table, designed so that session-based analyses can be performed efficiently.”


Table of Contents

  1. Problem Background
  2. A Session Is a Dimension
  3. Defining a Session
  4. Starting the Modeling
  5. Applying an Incremental Strategy & Guaranteeing Idempotency
  6. Example Uses of the Model
  7. Ideas for Improvement

1. Problem Background

At my company we analyze user behavior data collected through Mixpanel by exporting it into BigQuery. But as with most product analytics tools, the Mixpanel table has no Session ID field, which made it hard to serve session-based metrics efficiently. Handling requirements like the ones below was costing a meaningful amount of time and money.

  • “Please tell me the average session duration per user per day.”
  • “Please tell me the average duration of users’ first-visit sessions.”

To answer these requests we were leaning directly on the full event table, and because each model had subtly different semantics, it was unclear which definition produced the ‘correct’ metric. My team lead was aware of the same pain, so I landed on the following conclusion:

  • “First, let’s build an intermediate table that extracts session information from the full event table.”

2. A Session Is a Dimension

From a dimensional modeling perspective, there are several ways to look at a session.

  • Transactional Fact: treat the session itself as an event.
  • Periodic Snapshot Fact: treat the session as an aggregate summary of events.
  • Dimension: treat the session as a set of attributes about events.

I chose to manage sessions as a Dimension table, for the following reasons.

  • A session is not an individual event like a pageview, click, or purchase — it is a container of attributes.
  • Each session is not “re-occurring,” but it does carry a set of properties.

3. Defining a Session

Before modeling the session dimension table, I started by defining what a session actually is. My team lead — who has deep experience with session definitions — helped me lock in the following premise.

Session: a group of events that satisfies all three of the following conditions (1) The events were generated by the same user ID. (2) The events occurred on the same date. (3) The gap between any two consecutive events does not exceed 30 minutes.

In addition, for each session’s dimension attributes — country of access, device, OS, etc. — I decided to store the value first observed during that session. In other words, this is SCD (Slowly Changing Dimension) Type 0: keep the attribute value captured when the session began and never overwrite it.

SCD Type 0 is a strategy that never overwrites the original value even if the attribute changes — the first recorded value is preserved as-is. This keeps attributes consistent for within-session analysis, but it does mean any mid-session attribute change is not reflected. (For the latest dimension state, the separate SCD Type 1 table **dim__users** is available.)

4. Starting the Modeling

Based on the premise above, I sketched the schema for dim__sessions as follows.

Then I started writing the modeling query.

(1) Load the full event table.

  • During this step I also labeled each row with the previous event time and the session threshold (30 minutes).
{% set interval = 'day' %}
{% set session_threshold_secs = 1800 %}

with

fact__events as (
  select
    date,
    time,
    user_id,
    country,
    device,
    os,
    lag(date, 1) over (partition by user_id order by time) as prev_date,
    lag(time, 1) over (partition by user_id order by time) as prev_time,
    {{ session_threshold_secs }} as session_threshold_secs
  from
    {{ ref('fact__all_events') }}
),

(2) Assign a per-user session ID that increments by 1.

  • prev_time is null: treat as a new session.
  • date != prev_date: treat as a new session.
  • prev_time is not null AND gap exceeds the session threshold: treat as a new session.
  • Otherwise: treat as a continuation of the existing session.
fact__events_with_session_id as (
  select
    *,
    sum(
      case
        when prev_time is null then 1
        when date != prev_date then 1
        when timestamp_diff(time, prev_time, second) > session_threshold_secs then 1
        else 0
      end
    ) over (
      partition by user_id
      order by time
      rows between unbounded preceding and current row
    ) as session_id
  from
    fact__events
),

(3) Summarize information per session ID.

  • Following the SCD Type 0 approach, extract the first dimension value seen in each session (via first_value).
dim__sessions as (
  select
    -- Identifier
    user_id,
    session_id,
    -- Datetime
    first_value(date) over w as start_date,
    last_value(date) over w as end_date,
    first_value(time) over w as start_time,
    last_value(time) over w as end_time,
    timestamp_diff(
      last_value(time) over w,
      first_value(time) over w,
      second
    ) as session_duration_secs,
    -- SCD Type 0
    first_value(country ignore nulls) as country,
    first_value(device ignore nulls) as device,
    first_value(os ignore nulls) as os
  from
    fact__events_with_session_id
  qualify
    row_number() over (
      partition by user_id, session_id
      order by time
    ) = 1
  window w as (
    partition by user_id, session_id
    order by time
    rows between unbounded preceding and unbounded following
  )
)

select * from dim__sessions

5. Applying an Incremental Strategy & Guaranteeing Idempotency

The query above works fine if you full-scan the entire event table on every run. At scale, though, that becomes brutal in both time and cost. So I wanted to harden the model along two axes:

  • Make it work well under an incremental strategy as well (model economy).
  • Guarantee identical results even if the batch runs hundreds of times a day (model idempotency).

5.1. Make it work well under an incremental strategy (economy)

(1) Model configuration

  • I chose start_date (session start date) as the partition field. It’s the most natural target for incremental processing, and it’s also the column users filter on most often when querying.
  • I chose user_id as the clustering key. This should speed up future joins against dim__users.
{{
  config(
    materialized = 'incremental',
    incremental_strategy = 'insert_overwrite',
    partition_by = {
      "field": "start_date",
      "data_type": "date",
      "granularity": "day"
    },
    require_partition_filter = true,
    on_schema_change = 'append_new_columns',
    cluster_by = ["user_id"],
  )
}}

(2) Applying the incremental strategy

  • I wrote a separate dbt macro, incremental_partition_filter, that scans only the most recent 7 days of partitions.
  • The Mixpanel event table behaves like an Accumulated Snapshot Fact table — rows keep getting updated over time — so I let each batch re-process the last 7 days.
{% set interval = 'day' %}
{% set session_threshold_secs = 1800 %}

with

fact__events as (
  select
    ...
  from
    {{ ref('fact__all_events') }}
  where true
    {{ incremental_partition_filter('date', 7, 1, interval) }}
),

fact__events_with_session_id as (
  ...
),

dim__sessions as (
  ...
)

select * from dim__sessions

5.2. Guarantee identical results across hundreds of daily batch runs (idempotency)

Under the incremental strategy above, each user’s session_id gets freshly assigned as 1, 2, 3, … on every batch run. In other words, the model is ignoring the session IDs that already exist in the this table.

1, 2, 3, 1, 2, 3, 4, 1, 2, ...

To fix this, I taught the model to take the existing session IDs into account.

(1) Look up each user’s existing max session ID in the this table.

{% if is_incremental() %}
existing_max_session_id as (
  select
    user_id,
    max(session_id) as max_session_id
  from
    {{ this }}
  group by
    all
),
{% endif %}

(2) Recompute the final session ID.

  • Add the existing per-user max session ID to the session ID computed for the incremental batch.
  • That way, new sessions get a continuous sequence of IDs that picks up right after the existing ones.
dim__sessions_with_final_session_id as (
  select
    {% if is_incremental() %}
    new_data.* replace (
      coalesce(old_data.max_session_id, 0) + new_data.session_id as session_id
    )
    {% else %}
    new_data.*
    {% endif %}
  from
    dim__sessions as new_data
    {% if is_incremental() %}
    left join existing_max_session_id as old_data
      using (user_id)
    {% endif %}
)

select * from dim__sessions_with_final_session_id

(3) For idempotency, apply a complementary partition filter when reading from this.

  • I wrote a separate dbt macro, not_incremental_partition_filter, for this.
  • It scans only the partitions that are the complement of the incremental window.
  • This guarantees that the session_id values do not drift even if the batch runs hundreds of times a day.

{% set interval = 'day' %}
{% set session_threshold_secs = 1800 %}

with

fact__events as (
  select
    ...
  from
    {{ ref('fact__all_events') }}
  where true
    {{ incremental_partition_filter('date', 7, 1, interval) }}
),

fact__events_with_session_id as (
  ...
),

dim__sessions as (
  ...
),

{% if is_incremental() %}
existing_max_session_id as (
  select
    ...
  from
    {{ this }}
  where true
    {{ not_incremental_partition_filter('date', 7, 1, interval) }}
  group by
    all
),
{% endif %}

dim__sessions_with_final_session_id as (
  ...
)

select * from dim__sessions_with_final_session_id

6. Example Uses of the Model

With the finished dim__sessions table, a variety of requirements can be answered with simple queries.

(1) “Please tell me the average session duration per user per day.”

select
  start_date,
  safe_divide(
    sum(session_duration_secs),
    count(distinct user_id)
  ) as avg_session_time
from
  dim__sessions
where true
  and kst_start_date between '2025-01-01' and '2025-12-31'
group by
  all
order by
  1

(2) “Please tell me the average duration of users’ first-visit sessions.”

select
  start_date,
  safe_divide(
    sum(session_duration_secs),
    count(session_id)
  ) as avg_session_time
from
  dim__sessions
where true
  and kst_start_date between '2025-01-01' and '2025-12-31'
  and session_id = 1
group by
  all
order by
  1

7. Ideas for Improvement

After shipping the model, I was able to deliver session and duration metrics across the org far more reliably than before. Still, there are a couple of things I personally feel could be improved.

(1) The this table full-scan problem

  • To guarantee the incrementality of session_id, the current model almost fully scans the this table on every run. I found that unsatisfying.
  • Maintaining each user’s max session_id in a separate bridge table would avoid the full scan and meaningfully improve performance.

(2) Adding a session_id column to the event table

  • In the future, data analysts may want to dig into user journeys directly off the event table.
  • For that case, adding a session_id column to the event table itself would make analysis much more flexible and convenient. (This would be necessary when building, for example, a fact__users_journey table.)