GBase 8c SQL Optimization: Subplan Rewrite Optimization Case Studies

Cong Li - Jul 5 - - Dev Community

The GBase 8c distributed database architecture fully leverages the computational resources of each node, and its overall performance scales linearly with the number of nodes. To maximize performance and resource utilization in a distributed architecture, GBase 8c offers three distributed execution plans: FQS (Fast Query Shipping), Stream, and Remote-Query. Among these, FQS and Stream plans can be pushed down, meaning all Data Nodes (DNs) in the cluster participate in SQL execution.

Both execution plans fully utilize node resources. The difference is that the FQS plan sends the original statement directly from the Coordinator Node (CN) to all or part of the DNs, which execute the statement independently without data exchange. In contrast, the Stream plan generates an execution plan on the CN, which is then sent to the DNs. The DNs use Stream operators to exchange data during execution.

The Remote-Query plan is a compromise. When the first two plans cannot be generated, the CN creates a plan, sends part of the original statement to the DNs, which execute it independently, and then sends the results back to the CN to execute the remaining plan. This plan generally has poorer performance and is used less frequently.

Distributed databases often mean diverse application scenarios, massive data scales (PB-level), complex job logic, and long execution times, especially in the financial industry. The large data volumes and complex operations determine the extensive use of the Stream plan in production environments.

GBase 8c's Stream execution plan includes three Stream operators: gather, redistribute, and broadcast. The proper use of Stream operators makes large-scale data processing possible in a distributed architecture. However, these solutions often introduce new problems, making the optimization of Stream operators a crucial part of SQL optimization in GBase 8c. To optimize SQL performance in actual development, we often use the EXPLAIN command to analyze execution plans, striving to use the streaming execution plan in a distributed environment to enhance the utilization of distributed computational resources.

Below, we analyze two specific cases of subplan optimization in distributed execution plans.

Case 1

Original SQL

Image description

Execution Plan Analysis, Using EXPLAIN ANALYZE

Image description

Image description

The current SQL execution time is approximately 28 seconds. As highlighted in the diagram, the three subqueries in the SQL each use the pgxc execution plan, failing to effectively utilize the distributed streaming execution plan of GBase 8c, resulting in low efficiency. The three subplans correspond to the above execution plan, leading to inefficient execution. The key issue is eliminating the pgxc execution plan of the subplans. By modifying the SQL to use temporary tables and addressing the issue of the wm_contact system function not being pushable, we rewrite the SQL to use temporary tables as follows:

Image description

New Execution Plan Analysis

After applying EXPLAIN ANALYZE again, we see a changed execution plan:

Image description

Image description

The new execution plan eliminates the pgxc execution plans of the three subqueries and adopts the streaming execution plan. The execution time drops to approximately 600ms, significantly improving efficiency and demonstrating the advantages of distributed systems.

Case 2

Business development feedback indicated severe performance degradation compared to Oracle when executing the following SQL to update a large table. The execution time was on the order of hours:

  1. UPDATE zh_dhcplat.sp_org SET Subtype = NULL
  2. UPDATE zh_dhcplat.sp_org SET Subtype = (SELECT qylb FROM zh_cyzt.tpp_cyqyxx b WHERE a.obj_id = b.id) WHERE EXISTS (SELECT 1 FROM zh_cyzt.tpp_cyqyxx b WHERE a.obj_id = b.id)

During SQL2 execution, approximately 68,512 records were updated, but it took 3,686 seconds, roughly an hour, indicating very slow update speed. The environment is a three-shard GBase 8c distributed database cluster.

Image description

Execution Plan Analysis

Using EXPLAIN ANALYZE, we found the issue mainly caused by the subquery. After rewriting the SQL to use an UPDATE FROM approach, the subplan was eliminated, and the new execution plan showed that the update completed in less than 3 seconds:

Image description

Conclusion

The above cases illustrate the significant impact of execution plan changes on SQL execution performance in a distributed environment. These cases highlight inefficiencies caused by subqueries in GBase 8c's distributed environment and how rewriting SQL to eliminate subplans or pgxc execution plans can leverage the distributed streaming plan, enhancing SQL execution efficiency through multi-node concurrency. In GBase 8c development, it's crucial to use the EXPLAIN tool to analyze SQL execution plans to develop high-efficiency SQL.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .