Query Optimization by Using JOIN Instead of NOT IN

2024-08-13

“In this project, I optimized the incremental update strategy for the core_fct_events table to address query performance issues in our Enterprise Data Warehouse (EDW) environment. By replacing the inefficient NOT IN clause with a LEFT JOIN, I streamlined the duplicate data check process, reducing the overall orchestration time from 50 minutes to 2 minutes. This resulted in approximately a 96% performance improvement, significantly enhancing data processing efficiency and system resource utilization, thereby strengthening service stability and scalability.”

Performance Summary
- Orchestration Time: `50 mins` → `2 mins` (96% ↓)

STAR Summary
Situation
Tasks
Actions
Results

1. STAR Summary

Situation

In our Enterprise Data Warehouse (EDW), the orchestration process of the ELT pipeline was taking significantly longer than expected. Specifically, there were performance issues during the update process of the core_fct_events table.

Tasks

The goal was to optimize the incremental strategy of the core_fct_events table to reduce the overall orchestration time. This would enable us to handle increasing data traffic more efficiently and enhance service reliability.

Actions

I replaced the NOT IN clause with a LEFT JOIN to effectively filter duplicate data while optimizing performance.

Results

Through query optimization, the total orchestration time was reduced from 50 minutes to 2 minutes, achieving approximately 96% performance improvement and significantly enhancing data processing efficiency.

2. Situation

In our Enterprise Data Warehouse (EDW), the orchestration process of the ELT pipeline was taking significantly longer than expected. Specifically, there were performance issues during the update process of the core_fct_events table.

Specific Situation

Our company operates an Enterprise Data Warehouse (EDW) environment to provide B2B BI services. Every midnight, a complex data transformation process based on user event data is performed. However, this process was taking longer than expected. In particular, the incremental update process of the core_fct_events table, a key event table, was taking up most of the orchestration time. This delay in data refresh posed a risk of negatively impacting service quality.

3. Tasks

The goal was to optimize the incremental strategy of the core_fct_events table to reduce the overall orchestration time. This would enable us to handle increasing data traffic more efficiently and enhance service reliability.

Root Causes of the Problem

I identified three major issues in the update process of the core_fct_events table.

    WITH
    CTE_src_events AS (
        SELECT
            DISTINCT
            datetime,
            app_id,
            user_id,
            event_name
        FROM
            src_events
        -- Incremental Strategy: Read rows with a datetime greater than the maximum datetime currently stored in the table.
        {% if is_incremental() %}
        WHERE
            (SELECT MAX(datetime) FROM {{ this }}) < datetime
        {% endif %}
    )
    SELECT
        *
    FROM
        CTE_src_events
    -- Incremental Strategy: Exclude data that already exists in the table. Do not insert those rows.
    {% if is_incremental() %}
    WHERE
        (datetime, app_id, user_id, event_name) NOT IN (SELECT datetime, app_id, user_id, event_name FROM {{ this }})
    {% endif %}

1. Large Data Volume

The core_fct_events table contained all user event log data, making the table size very large.

2. Presence of Duplicate Rows

Due to the existence of duplicate data in the source table itself, the DISTINCT keyword had to be used to remove duplicates.

3. Inefficient Duplicate Check Method

The existing query used the NOT IN clause to compare new data with existing data, which was the main cause of the performance bottleneck. This clause triggers nested loop searches, causing performance degradation as the table size increases.

4. Actions

I replaced the NOT IN clause with a LEFT JOIN to effectively filter duplicate data while optimizing performance.

Specific Actions Taken

1. Problem Analysis and Alternative Exploration

First, I identified that the NOT IN clause was the primary cause of the performance bottleneck. The NOT IN clause requires the database engine to check all possible combinations through nested loops, making it highly inefficient.

2. Query Refactoring

I replaced the existing NOT IN clause with a LEFT JOIN. Using a LEFT JOIN allows for more efficient comparison between the existing table and the new data. Specifically, after performing the LEFT JOIN, only new data that does not exist in the existing data is inserted by filtering for NULL values.

    SELECT
        MAIN.*
    FROM 
        CTE_src_events MAIN
    -- Incremental Strategy: Exclude data that already exists in the table. Do not insert those rows.
    {% if is_incremental() %}
    LEFT JOIN
        {{ this }} THIS
        ON MAIN.datetime = THIS.datetime
        AND MAIN.app_id = THIS.app_id
        AND MAIN.user_id = THIS.user_id
        AND MAIN.event_name = THIS.event_name
    WHERE
        THIS.datetime IS NULL    
    {% endif %}

3. Performance Testing and Validation

After modifying the query, I conducted performance tests using various data sets. This confirmed that the query execution time was significantly reduced. The optimized query execution time was reduced from 50 minutes to approximately 2 minutes.

5. Results

Through query optimization, the total orchestration time was reduced from 50 minutes to 2 minutes, achieving approximately 96% performance improvement and significantly enhancing data processing efficiency.

1. Performance Improvement

The total orchestration time was reduced from 50 minutes to 2 minutes, representing approximately 96% performance improvement, drastically enhancing data processing speed.

2. Improved Resource Efficiency

Efficient use of database resources reduced system load, allowing other queries and tasks to execute more smoothly.

3. Enhanced Service Reliability

Faster and more reliable data updates provided a more dependable service to users.

4. Future Scalability Secured

The optimization efforts in preparation for increased traffic and data expansion have laid a foundation for easily meeting future data processing requirements.

Conclusion

This query optimization project significantly enhanced the performance of our data warehouse while also strengthening our analytics engineering capabilities. It was a valuable contribution to maximizing data processing efficiency and improving the quality of our BI services.

Published by Joshua Kim

 Redash 대시보드 활용 후기 NOT IN 대신 JOIN을 통한 쿼리 최적화 

Query Optimization by Using JOIN Instead of NOT IN

Table of Contents

1. STAR Summary

Situation

Tasks

Actions

Results

2. Situation

Specific Situation

3. Tasks

Root Causes of the Problem

1. Large Data Volume

2. Presence of Duplicate Rows

3. Inefficient Duplicate Check Method

4. Actions

Specific Actions Taken

1. Problem Analysis and Alternative Exploration

2. Query Refactoring

3. Performance Testing and Validation

5. Results

1. Performance Improvement

2. Improved Resource Efficiency

3. Enhanced Service Reliability

4. Future Scalability Secured

Conclusion

Published by Joshua Kim