IP Address-Country Mapping Query Optimization
2024-05-19
“Optimized the process of mapping user country information using IP addresses in a data warehouse. The previous method was inefficient due to long processing times, but the new approach reduced query execution time by 90%.”
Performance Summary |
---|
- Query Execution Time: 100 mins → 10 mins |
Table of Contents
- STAR Summary
- Situation
- Tasks
- Actions
- Results
1. STAR Summary
Situation
- In an environment operating a global service, it was necessary to map user country information based on connection IP addresses. This task was performed using PostgreSQL, but the previous approach resulted in performance degradation. The existing query took too long to execute, placing a significant burden on data warehouse operations.
Tasks
- The goal was to optimize the existing IP address mapping query to drastically reduce processing time. This optimization aimed to improve the efficiency of the transformation process, achieving better data warehouse performance. Specifically, the core task was to optimize the computational process for IP address mapping and enhance the efficiency of the JOIN conditions.
Actions
- Analysis of the Existing Approach
- The existing query used the
<<=
operator to map IP addresses to CIDR networks, identified as the main cause of performance degradation.
- The existing query used the
- Creation of a New Table
- Created a new table,
dim_ips_countries
, by processing the existing table.- This table includes
start_ip
andend_ip
columns. - Converted IP addresses to BIGINT type to improve the efficiency of comparison operations.
- This table includes
- Created a new table,
- Index Creation
- Created indexes on the
start_ip
andend_ip
columns to maximize search performance.
- Created indexes on the
- Query Optimization
- Replaced the existing
<<=
operator with theBETWEEN
operator to simplify IP address comparison and restructure the query for lightweight operations.
- Replaced the existing
Results
- The optimized query reduced execution time by 90%, significantly improving data warehouse performance.
2. Situation
- In an environment operating a global service, it was necessary to map user country information based on connection IP addresses. This task was performed using PostgreSQL, but the previous approach resulted in performance degradation. The existing query took too long to execute, placing a significant burden on data warehouse operations.
Problem Summary
- In an environment operating a global service, it was necessary to map IP addresses collected from user connections to geographical information such as the country. For this, a table mapping IP addresses to country names was created using PostgreSQL’s CIDR operator, but the existing method was highly inefficient.
Specific Problem Context
- The core task was to map the
session_ip
column in thesrc_sessions
table to thecidr
column in thesrc_cidrs_countries
table to create thefct_sessions
table. - The existing query used the
<<=
operator to determine whether an IP address was included in a specific CIDR network. However, this method caused performance issues during large-scale data processing, with query execution times being excessively long. This degraded data warehouse performance and caused severe operational disruptions.
3. Tasks
- The goal was to optimize the existing IP address mapping query to drastically reduce processing time. This optimization aimed to improve the efficiency of the transformation process, achieving better data warehouse performance. Specifically, the core task was to optimize the computational process for IP address mapping and enhance the efficiency of the JOIN conditions.
Assigned Tasks
- The main objective was to significantly reduce the inefficient query execution time.
1. Improve Data Processing Speed
- Reduce execution time to enhance data processing efficiency and prevent disruption to service operations.
2. Query Optimization
- Design and implement an optimized query structure that can perform IP address and country information mapping more efficiently.
3. System Performance Improvement
- Improve overall data warehouse performance to establish a foundation that can handle future data expansion and growth.
4. Actions
- Analysis of the Existing Approach
- The existing query used the
<<=
operator to map IP addresses to CIDR networks, identified as the main cause of performance degradation.- Creation of a New Table
- Created a new table,
dim_ips_countries
, by processing the existing table.
- This table includes
start_ip
andend_ip
columns.- Converted IP addresses to BIGINT type to improve the efficiency of comparison operations.
- Index Creation
- Created indexes on the
start_ip
andend_ip
columns to maximize search performance.- Query Optimization
- Replaced the existing
<<=
operator with theBETWEEN
operator to simplify IP address comparison and restructure the query for lightweight operations.
1. Analysis of the Existing Approach
(STEP 1) Create an index on the src_cidrs_countries
table.
- Since the
cidr
column needs to be frequently scanned during the JOIN operation with thesrc_sessions
table, an index was created on this column.View code
CREATE INDEX idx_cidr ON src_cidrs_countries (cidr);
(STEP 2) Created the fct_sessions
table by joining the src_cidrs_countries
table with the src_sessions
table.
- The
<<=
operator was used in the JOIN process. - However, this operator, which compares CIDR types, had a high computational cost, causing performance degradation during large-scale data processing.
View code
CREATE TABLE fct_sessions AS SELECT S.session_id, S.user_id, C.country FROM src_sessions S LEFT JOIN src_cidrs_countries C ON S.session_ip::INET <<= C.cidr;
2. Creation of a New Table
- A new table,
dim_ips_countries
, was created by processing the existingsrc_cidrs_countries
table. This table was newly designed to minimize CIDR operations, addingstart_ip
andend_ip
columns representing the IP address range for each CIDR value. - By converting IP addresses to BIGINT type for storage, computation speed was greatly improved.
start_ip
andend_ip
represent the lowest and highest IPs within the CIDR range, simplifying the process of verifying IP addresses within a CIDR network.
View code
CREATE TABLE dim_ips_countries AS
SELECT
cidr,
('x' ||
LPAD(TO_HEX((SPLIT_PART(HOST(cidr), '.', 1)::INTEGER)), 2, '0') ||
LPAD(TO_HEX((SPLIT_PART(HOST(cidr), '.', 2)::INTEGER)), 2, '0') ||
LPAD(TO_HEX((SPLIT_PART(HOST(cidr), '.', 3)::INTEGER)), 2, '0') ||
LPAD(TO_HEX((SPLIT_PART(HOST(cidr), '.', 4)::INTEGER)), 2, '0')
)::BIT(32)::BIGINT AS start_ip,
('x' ||
LPAD(TO_HEX((SPLIT_PART(HOST(BROADCAST(cidr)), '.', 1)::INTEGER)), 2, '0') ||
LPAD(TO_HEX((SPLIT_PART(HOST(BROADCAST(cidr)), '.', 2)::INTEGER)), 2, '0') ||
LPAD(TO_HEX((SPLIT_PART(HOST(BROADCAST(cidr)), '.', 3)::INTEGER)), 2, '0') ||
LPAD(TO_HEX((SPLIT_PART(HOST(BROADCAST(cidr)), '.', 4)::INTEGER)), 2, '0')
)::BIT(32)::BIGINT AS end_ip,
country
FROM
src_cidrs_countries;
3. Index Creation
- Since the
start_ip
andend_ip
columns need to be frequently scanned during the JOIN operation with thesrc_sessions
table, an index was created on these two columns. - This index was designed to efficiently find values within an IP address range, enabling quick searches during JOIN operations.
View code
CREATE INDEX idx_ip_range ON dim_ips_countries (start_ip, end_ip);
4. Query Optimization
- The existing
<<=
operator was replaced with theBETWEEN
operator to simplify IP address comparison and restructure the query for lightweight operations. - The
session_ip
column in thesrc_sessions
table was also converted from CIDR to IP address, then to BIGINT type, to be compared with thestart_ip
andend_ip
columns in thedim_ips_countries
table.View code
CREATE TABLE fct_sessions AS SELECT S.session_id, S.user_id, C.country FROM ( SELECT session_id, user_id, ('x' || LPAD(TO_HEX((SPLIT_PART(HOST(session_ip::INET), '.', 1)::INTEGER)), 2, '0') || LPAD(TO_HEX((SPLIT_PART(HOST(session_ip::INET), '.', 2)::INTEGER)), 2, '0') || LPAD(TO_HEX((SPLIT_PART(HOST(session_ip::INET), '.', 3)::INTEGER)), 2, '0') || LPAD(TO_HEX((SPLIT_PART(HOST(session_ip::INET), '.', 4)::INTEGER)), 2, '0') )::BIT(32)::BIGINT AS session_ip FROM src_sessions ) S LEFT JOIN src_cidrs_countries C ON S.session_ip BETWEEN C.start_ip AND C.end_ip;
5. Results
- The optimized query reduced execution time by 90%, significantly improving data warehouse performance.
1. Positive Results
- The optimized approach reduced query execution time from approximately 100x to 10x, effectively cutting execution time by around 90%. This significantly improved data processing speed and system performance. The new approach also maintained stable performance during large-scale data processing and established a foundation that could flexibly respond to future data growth.
- This optimization effort maximized the efficiency of the data warehouse, reducing operational costs and enabling better data analysis and service delivery. As a result, the overall system performance was greatly enhanced, making a significant contribution to the company’s data operation strategy.
2. Lessons Learned
- As shown in the figure below, SQL JOINs involve a nested loop search process, which needs to be carefully considered.
(TIP 1) Columns used in JOIN conditions should have lightweight data types.
- In the existing approach, the
cidr
column was of the CIDR type, but in the new approach, it was parsed into BIGINT to make the data type lighter.
(TIP 2) Operators used in JOIN conditions should have lightweight processes.
- In the existing approach, the heavier
<<=
operator was used, but in the new approach, theBETWEEN
operator was used to reduce the burden.