IP Address-Country Mapping Query Optimization

2024-05-19

“Optimized the process of mapping user country information using IP addresses in a data warehouse. The previous method was inefficient due to long processing times, but the new approach reduced query execution time by 90%.”

Performance Summary
- Query Execution Time: `100 mins` → `10 mins`

STAR Summary
Situation
Tasks
Actions
Results

1. STAR Summary

Situation

In an environment operating a global service, it was necessary to map user country information based on connection IP addresses. This task was performed using PostgreSQL, but the previous approach resulted in performance degradation. The existing query took too long to execute, placing a significant burden on data warehouse operations.

Tasks

The goal was to optimize the existing IP address mapping query to drastically reduce processing time. This optimization aimed to improve the efficiency of the transformation process, achieving better data warehouse performance. Specifically, the core task was to optimize the computational process for IP address mapping and enhance the efficiency of the JOIN conditions.

Actions

Analysis of the Existing Approach
- The existing query used the <<= operator to map IP addresses to CIDR networks, identified as the main cause of performance degradation.
Creation of a New Table
- Created a new table, dim_ips_countries, by processing the existing table.
  - This table includes start_ip and end_ip columns.
  - Converted IP addresses to BIGINT type to improve the efficiency of comparison operations.
Index Creation
- Created indexes on the start_ip and end_ip columns to maximize search performance.
Query Optimization
- Replaced the existing <<= operator with the BETWEEN operator to simplify IP address comparison and restructure the query for lightweight operations.

Results

The optimized query reduced execution time by 90%, significantly improving data warehouse performance.

2. Situation

In an environment operating a global service, it was necessary to map user country information based on connection IP addresses. This task was performed using PostgreSQL, but the previous approach resulted in performance degradation. The existing query took too long to execute, placing a significant burden on data warehouse operations.

Problem Summary

In an environment operating a global service, it was necessary to map IP addresses collected from user connections to geographical information such as the country. For this, a table mapping IP addresses to country names was created using PostgreSQL’s CIDR operator, but the existing method was highly inefficient.

Specific Problem Context

The core task was to map the session_ip column in the src_sessions table to the cidr column in the src_cidrs_countries table to create the fct_sessions table.
The existing query used the <<= operator to determine whether an IP address was included in a specific CIDR network. However, this method caused performance issues during large-scale data processing, with query execution times being excessively long. This degraded data warehouse performance and caused severe operational disruptions.

3. Tasks

The goal was to optimize the existing IP address mapping query to drastically reduce processing time. This optimization aimed to improve the efficiency of the transformation process, achieving better data warehouse performance. Specifically, the core task was to optimize the computational process for IP address mapping and enhance the efficiency of the JOIN conditions.

Assigned Tasks

The main objective was to significantly reduce the inefficient query execution time.

1. Improve Data Processing Speed

Reduce execution time to enhance data processing efficiency and prevent disruption to service operations.

2. Query Optimization

Design and implement an optimized query structure that can perform IP address and country information mapping more efficiently.

3. System Performance Improvement

Improve overall data warehouse performance to establish a foundation that can handle future data expansion and growth.

4. Actions

Analysis of the Existing Approach

The existing query used the <<= operator to map IP addresses to CIDR networks, identified as the main cause of performance degradation.

Creation of a New Table

Created a new table, dim_ips_countries, by processing the existing table.

This table includes start_ip and end_ip columns.

Converted IP addresses to BIGINT type to improve the efficiency of comparison operations.

Index Creation

Created indexes on the start_ip and end_ip columns to maximize search performance.

Query Optimization

Replaced the existing <<= operator with the BETWEEN operator to simplify IP address comparison and restructure the query for lightweight operations.

1. Analysis of the Existing Approach

(STEP 1) Create an index on the src_cidrs_countries table.

Since the cidr column needs to be frequently scanned during the JOIN operation with the src_sessions table, an index was created on this column.
View code
CREATE INDEX idx_cidr ON src_cidrs_countries (cidr);

(STEP 2) Created the fct_sessions table by joining the src_cidrs_countries table with the src_sessions table.

The <<= operator was used in the JOIN process.

However, this operator, which compares CIDR types, had a high computational cost, causing performance degradation during large-scale data processing.

View code

  CREATE TABLE fct_sessions AS
    SELECT
      S.session_id,
      S.user_id,
      C.country
    FROM
      src_sessions S
    LEFT JOIN
      src_cidrs_countries C
      ON S.session_ip::INET <<= C.cidr;

2. Creation of a New Table

A new table, dim_ips_countries, was created by processing the existing src_cidrs_countries table. This table was newly designed to minimize CIDR operations, adding start_ip and end_ip columns representing the IP address range for each CIDR value.
By converting IP addresses to BIGINT type for storage, computation speed was greatly improved. start_ip and end_ip represent the lowest and highest IPs within the CIDR range, simplifying the process of verifying IP addresses within a CIDR network.

View code

  CREATE TABLE dim_ips_countries AS
    SELECT
      cidr,
      ('x' || 
      LPAD(TO_HEX((SPLIT_PART(HOST(cidr), '.', 1)::INTEGER)), 2, '0') ||
      LPAD(TO_HEX((SPLIT_PART(HOST(cidr), '.', 2)::INTEGER)), 2, '0') ||
      LPAD(TO_HEX((SPLIT_PART(HOST(cidr), '.', 3)::INTEGER)), 2, '0') ||
      LPAD(TO_HEX((SPLIT_PART(HOST(cidr), '.', 4)::INTEGER)), 2, '0')
      )::BIT(32)::BIGINT AS start_ip,
      ('x' ||
      LPAD(TO_HEX((SPLIT_PART(HOST(BROADCAST(cidr)), '.', 1)::INTEGER)), 2, '0') ||
      LPAD(TO_HEX((SPLIT_PART(HOST(BROADCAST(cidr)), '.', 2)::INTEGER)), 2, '0') ||
      LPAD(TO_HEX((SPLIT_PART(HOST(BROADCAST(cidr)), '.', 3)::INTEGER)), 2, '0') ||
      LPAD(TO_HEX((SPLIT_PART(HOST(BROADCAST(cidr)), '.', 4)::INTEGER)), 2, '0')
      )::BIT(32)::BIGINT AS end_ip,				
      country
    FROM
      src_cidrs_countries;

3. Index Creation

Since the start_ip and end_ip columns need to be frequently scanned during the JOIN operation with the src_sessions table, an index was created on these two columns.
This index was designed to efficiently find values within an IP address range, enabling quick searches during JOIN operations.
View code
CREATE INDEX idx_ip_range ON dim_ips_countries (start_ip, end_ip);

4. Query Optimization

The existing <<= operator was replaced with the BETWEEN operator to simplify IP address comparison and restructure the query for lightweight operations.

The session_ip column in the src_sessions table was also converted from CIDR to IP address, then to BIGINT type, to be compared with the start_ip and end_ip columns in the dim_ips_countries table.

View code

  CREATE TABLE fct_sessions AS
    SELECT
    S.session_id,
    S.user_id,
    C.country
  FROM (
    SELECT
      session_id,
      user_id,
      ('x' ||
      LPAD(TO_HEX((SPLIT_PART(HOST(session_ip::INET), '.', 1)::INTEGER)), 2, '0') ||
      LPAD(TO_HEX((SPLIT_PART(HOST(session_ip::INET), '.', 2)::INTEGER)), 2, '0') ||
      LPAD(TO_HEX((SPLIT_PART(HOST(session_ip::INET), '.', 3)::INTEGER)), 2, '0') ||
      LPAD(TO_HEX((SPLIT_PART(HOST(session_ip::INET), '.', 4)::INTEGER)), 2, '0')                
      )::BIT(32)::BIGINT AS session_ip
    FROM
      src_sessions
  ) S
  LEFT JOIN
    src_cidrs_countries C
    ON S.session_ip BETWEEN C.start_ip AND C.end_ip;

5. Results

The optimized query reduced execution time by 90%, significantly improving data warehouse performance.

1. Positive Results

The optimized approach reduced query execution time from approximately 100x to 10x, effectively cutting execution time by around 90%. This significantly improved data processing speed and system performance. The new approach also maintained stable performance during large-scale data processing and established a foundation that could flexibly respond to future data growth.
This optimization effort maximized the efficiency of the data warehouse, reducing operational costs and enabling better data analysis and service delivery. As a result, the overall system performance was greatly enhanced, making a significant contribution to the company’s data operation strategy.

2. Lessons Learned

As shown in the figure below, SQL JOINs involve a nested loop search process, which needs to be carefully considered.

(TIP 1) Columns used in JOIN conditions should have lightweight data types.

In the existing approach, the cidr column was of the CIDR type, but in the new approach, it was parsed into BIGINT to make the data type lighter.

(TIP 2) Operators used in JOIN conditions should have lightweight processes.

In the existing approach, the heavier <<= operator was used, but in the new approach, the BETWEEN operator was used to reduce the burden.

Published by Joshua Kim

 데이터 분석가의 SQL 최적화 일기: 코호트 리텐션 Batch Query 만들기 IP 주소-국가 매핑 쿼리 최적화 

IP Address-Country Mapping Query Optimization

Table of Contents

1. STAR Summary

Situation

Tasks

Actions

Results

2. Situation

Problem Summary

Specific Problem Context

3. Tasks

Assigned Tasks

1. Improve Data Processing Speed

2. Query Optimization

3. System Performance Improvement

4. Actions

1. Analysis of the Existing Approach

2. Creation of a New Table

3. Index Creation

4. Query Optimization

5. Results

1. Positive Results

2. Lessons Learned

(TIP 1) Columns used in JOIN conditions should have lightweight data types.

(TIP 2) Operators used in JOIN conditions should have lightweight processes.

Published by Joshua Kim