Category: SPL High-performance algorithm
The impasse of SQL performance optimizing
Many big data calculations are implemented in SQL. When running slowly, we have to optimize SQL, but we often encounter situations that we can do nothing about it.
For example, there are three statements in the stored procedure, which are roughly like this, and execute very slowly:
Performance Optimization Skill: First-half Ordered Sorting
I. Problem introduction & solution
When performing sorting operation on the data set, the following situation sometimes may occur: when the data set T is already ordered by field “a” but not by field “b”, we want to sort T by fields “a”, “b”, and this is what we call the first-half ordered sorting (“a” is ordered). For this case, we come up with an optimization technique to speed up the sorting: first retrieve a group of records with the same “a” value from T and sort them by field “b”; then get next group of records with the same “a” value in turn and do the sorting by “b”; and continue the above actions until all the records in T are retrieved and sorted. With this method, we don’t need to sort the records in T all at once, instead, only a group of them is retrieved each time, which only requires the memory to be able to hold a small group of records.
However, it’s a pity that SQL doesn’t support this algorithm, and it always requires the full sorting on all records. While SPL is in support of this algorithm, so we’ll test it and compare it with the sorting in Oracle.
Performance Optimization Skill: Second-half Ordered Grouping
I. Problem introduction & solution
First of all, what is second-half ordered grouping? When data set T is already ordered by field “a” and field “b” and we want to sort or group the table T by field “b”, in this case, in the segment with the same “a” value, field “b” is always ordered. So this is the scenario where the sorting field/grouping field is ordered in each segment grouped by the first-half field, and we name it as second-half ordered grouping.
As we all know, the quick-sort algorithm is to segment data then sort and merge recursively. For data set that is already ordered by the second-half fields, the execution speed of quick-sort algorithm can be very fast already. So if we sort the data set T by field “b” using the quick-sort algorithm, then we can optimize the grouping with the technique introduced in Performance Optimization Skill: Ordered Grouping.
Performance optimization skill: ordered grouping
I. Problem introduction & solution
Generally, the hash method is adopted in grouping operation, i.e., calculate the hash value of the grouping field first, save the records with the same hash value in a small set, then traverse the records in the small set to find those with the same grouping field value and aggregate them into groups. The complexity (number of comparisons) of the grouping depends on the duplicate value rate of the hash function. More specifically, when the hash space is small, the duplicate value rate and the number of comparisons will be high, thus degrading the performance greatly. So to avoid this situation, we need to allocate more memory to store the hash table. Also, the hash calculation will become slow in terms of certain data types (like long strings), which affects the performance badly.
If the grouping fields are ordered, each record only needs to be compared with the previous record when grouping. If the grouping field of the current record is different from that of the previous one, a new group will be created; if the grouping fields are the same, the record will be aggregated into the current group. The complexity of such grouping operation is n (the length of the set to be grouped) without the problem of hash calculation and duplicate value rate, which executes faster than the hash grouping and does not need much memory to store the hash table.
Performance optimization skill: Associating Big Fact Table with Big Dimension Table
I Problem introduction & Solution
In Performance Optimization Skill: Associating Small Fact Table with Big Dimension Table, the SPL algorithm utilizes the feature of a small fact table which can be loaded in memory, thus it sorts and collects all the association key values from the fact table, and searches in the dimension table to find the target records, which avoids traversing the big dimension table. But how can we improve the performance if both the fact table and the dimension table exceed the memory?
The solution provided in SQL is to respectively HASH partition the fact table and the dimension table into small parts that can be loaded in memory, write them on the external storage, and then load each of them in memory to do the in-memory association. Unfortunately, if a certain part is still too large for the memory, a second HASH partitioning is needed. Meanwhile, we need to do HASH partitioning on both tables, that is, buffering all the data of both tables.
Performance optimization skill: Associating Small Fact Table with Big Dimension Table
I Problem introduction & Solution
When we handle the primary-sub table association query, we sometimes find that the filtered fact data is small enough to be wholly loaded in memory or slightly exceeds the memory whereas the dimension table to be associated contains a large volume of data, far larger than the available memory space. In this case, if the dimension table is ordered by its key, we can use binary search to locate the small amount of dimension records corresponding to the fact records all at once rather than traverse all records of the dimension table like what the HASH algorithm does, which improves the operation performance efficiently.
This is exactly the algorithm that SPL provides to us, and in the following part we’ll test the approach and compare it with the HASH JOIN algorithm in Oracle.
Performance optimization skill: Attached Table
I. Problem introduction & Solution
Though we already have ordered MERGE scheme to boost the performance of join between a primary table and its sub table in Performance Optimization Skill: Ordered MERGE, we never intend to stop our work in finding a faster one. This time we will use esProc attached table to speed up the primary-sub tables association. In esProc, a composite table can save multiple tables, such as a primary table and its sub table, in one file. We create a composite table file from the primary table and then attach the sub table to the primary table. The sub table is thus an attached table that associates with the primary table through a field which is also the dimension field of the primary table.
When the composite table stores the association field in the attached table, it stores the association key in the primary table once and doesn’t store in the sub table physically. Such storage structure can reduce the disk reading actions when retrieving the attached table. The sub table is attached to the primary table through the association field, which is equivalent to pre-association, sparing the comparisons for association compared to ordered MERGE association algorithm and thus making the operation faster.
Performance optimization skill: Ordered Locate Association to speed up the filtering on joined primary-sub tables
I. Problem introduction & Solution
The ordered MERGE algorithm has proved a big success in improving the association performance in Performance Optimization Skill: Ordered MERGE. Yet we always want faster performance, so this time we try to load as few records as possible.
Usually, the association of a primary table and its sub table is followed by other more calculations such as filtering. The ordered MERGE algorithm first retrieves the primary table and the sub table respectively and then performs association on them. When many records in one of the two associated tables are filtered out, the other associated table are left with many records that cannot be associated, which is not known before doing the MERGE, so they will still be loaded, thus wasting a lot of time.
Performance optimization skill: Ordered MERGE
I Problem introduction & solving
Relational databases use segmented HASH approach to associate tables. Suppose the sizes (number of records) of the two tables to be joined are N and M respectively, then the computational complexity (i.e. the comparison number of join field) of HASH segmentation method is about SUM(Ni*Mi), in which Ni and Mi are respectively the number of records of the two tables with HASH values as i, satisfying expressions N=SUM(Ni) and M=SUM(Mi). The value of SUM(Ni*Mi) is probably much smaller than the complexity N*M of full traversal (sometimes even K, the range of HASH values, times smaller).
If both tables are ordered by join keys, we can use MERGE algorithm to implement the association, whose complexity is N+M in this case. If both N and M are relatively large (generally much larger than K), N+M will be far smaller than SUM(Ni*Mi). That is, the MERGE algorithm is a lot faster than segmented HASH approach.
Performance optimization skill: Association For Dimension Table Filtering & Computation
Often a join query between the fact table and a dimension table involves filtering and computation over the dimension data, which can be optimized in two approaches:
1. First join the fact table and the dimension table (pre-associate them if they can fit into the memory) and then filter the associated fact table just like the calculations in Performance Optimization Skill: Pre-Association and Performance Optimization Skill: Numberizing Foreign Key.