# I. Problem introduction & solution

When performing sorting operation on the data set, the following situation sometimes may occur: when the data set T is already ordered by field “a” but not by field “b”, we want to sort T by fields “a”, “b”, and this is what we call the first-half ordered sorting (“a” is ordered). For this case, we come up with an optimization technique to speed up the sorting: first retrieve a group of records with the same “a” value from T and sort them by field “b”; then get next group of records with the same “a” value in turn and do the sorting by “b”; and continue the above actions until all the records in T are retrieved and sorted. With this method, we don’t need to sort the records in T all at once, instead, only a group of them is retrieved each time, which only requires the memory to be able to hold a small group of records.

However, it’s a pity that SQL doesn’t support this algorithm, and it always requires the full sorting on all records. While SPL is in support of this algorithm, so we’ll test it and compare it with the sorting in Oracle.

# I. Problem introduction & solution

First of all, what is second-half ordered grouping? When data set T is already ordered by field “a” and field “b” and we want to sort or group the table T by field “b”, in this case, in the segment with the same “a” value, field “b” is always ordered. So this is the scenario where the sorting field/grouping field is ordered in each segment grouped by the first-half field, and we name it as second-half ordered grouping.

As we all know, the quick-sort algorithm is to segment data then sort and merge recursively. For data set that is already ordered by the second-half fields, the execution speed of quick-sort algorithm can be very fast already. So if we sort the data set T by field “b” using the quick-sort algorithm, then we can optimize the grouping with the technique introduced in Performance Optimization Skill: Ordered Grouping.

# I. Problem introduction & solution

Generally, the hash method is adopted in grouping operation, i.e., calculate the hash value of the grouping field first, save the records with the same hash value in a small set, then traverse the records in the small set to find those with the same grouping field value and aggregate them into groups. The complexity (number of comparisons) of the grouping depends on the duplicate value rate of the hash function. More specifically, when the hash space is small, the duplicate value rate and the number of comparisons will be high, thus degrading the performance greatly. So to avoid this situation, we need to allocate more memory to store the hash table. Also, the hash calculation will become slow in terms of certain data types (like long strings), which affects the performance badly.

If the grouping fields are ordered, each record only needs to be compared with the previous record when grouping. If the grouping field of the current record is different from that of the previous one, a new group will be created; if the grouping fields are the same, the record will be aggregated into the current group. The complexity of such grouping operation is n (the length of the set to be grouped) without the problem of hash calculation and duplicate value rate, which executes faster than the hash grouping and does not need much memory to store the hash table.

# I Problem introduction & Solution

In Performance Optimization Skill: Associating Small Fact Table with Big Dimension Table, the SPL algorithm utilizes the feature of a small fact table which can be loaded in memory, thus it sorts and collects all the association key values from the fact table, and searches in the dimension table to find the target records, which avoids traversing the big dimension table. But how can we improve the performance if both the fact table and the dimension table exceed the memory?

The solution provided in SQL is to respectively HASH partition the fact table and the dimension table into small parts that can be loaded in memory, write them on the external storage, and then load each of them in memory to do the in-memory association. Unfortunately, if a certain part is still too large for the memory, a second HASH partitioning is needed. Meanwhile, we need to do HASH partitioning on both tables, that is, buffering all the data of both tables.

# I Problem introduction & Solution

When we handle the primary-sub table association query, we sometimes find that the filtered fact data is small enough to be wholly loaded in memory or slightly exceeds the memory whereas the dimension table to be associated contains a large volume of data, far larger than the available memory space. In this case, if the dimension table is ordered by its key, we can use binary search to locate the small amount of dimension records corresponding to the fact records all at once rather than traverse all records of the dimension table like what the HASH algorithm does, which improves the operation performance efficiently.

This is exactly the algorithm that SPL provides to us, and in the following part we’ll test the approach and compare it with the HASH JOIN algorithm in Oracle.

# I. Problem introduction & Solution

The ordered MERGE algorithm has proved a big success in improving the association performance in Performance Optimization Skill: Ordered MERGE. Yet we always want faster performance, so this time we try to load as few records as possible.

Usually, the association of a primary table and its sub table is followed by other more calculations such as filtering. The ordered MERGE algorithm first retrieves the primary table and the sub table respectively and then performs association on them. When many records in one of the two associated tables are filtered out, the other associated table are left with many records that cannot be associated, which is not known before doing the MERGE, so they will still be loaded, thus wasting a lot of time.

# I Problem introduction & solving

Relational databases use segmented HASH approach to associate tables. Suppose the sizes (number of records) of the two tables to be joined are N and M respectively, then the computational complexity (i.e. the comparison number of join field) of HASH segmentation method is about SUM(Ni*Mi), in which Ni and Mi are respectively the number of records of the two tables with HASH values as i, satisfying expressions N=SUM(Ni) and M=SUM(Mi). The value of SUM(Ni*Mi) is probably much smaller than the complexity N*M of full traversal (sometimes even K, the range of HASH values, times smaller).

If both tables are ordered by join keys, we can use MERGE algorithm to implement the association, whose complexity is N+M in this case. If both N and M are relatively large (generally much larger than K), N+M will be far smaller than SUM(Ni*Mi). That is, the MERGE algorithm is a lot faster than segmented HASH approach.

## Performance optimization skill: Association For Dimension Table Filtering & Computation

Often a join query between the fact table and a dimension table involves filtering and computation over the dimension data, which can be optimized in two approaches:

1. First join the fact table and the dimension table (pre-associate them if they can fit into the memory) and then filter the associated fact table just like the calculations in Performance Optimization Skill: Pre-Association and Performance Optimization Skill: Numberizing Foreign Key

# I. Problem introduction & solving

In Performance Optimization Skill: Partial Pre-Association, we introduced the technique of loading dimension tables in memory and pre-associating them which effectively improves performance when the memory isn’t large enough to hold all the data. However, when associating the fact table with dimension tables, we still need to do hash computation and comparison, which can also be improved by another optimization skill, i.e., numberizing foreign key.

The idea is this: if the primary key values of a dimension table are natural numbers starting from 1 (which are equivalent to row numbers), we can locate a record in the dimension table through the key values, i.e., according to the row numbers, rather than by computing and comparing hash values. This method speeds up the association of a dimension table with the fact table and eventually makes the whole computation faster. The numberization-based location makes it unnecessary to create indexes on dimensions and thus reduces memory consumption.

# I. Problem introduction & solving

In Performance optimization skill: Pre-Association, we tested the technique of pre-loading data tables in memory and querying after pre-association. But if the size of dimension table(s) and fact table exceeds the memory, we can load the dimensional table(s) only in memory and index it (them) to achieve at least half of the hash computation, which is called partial pre-association.

We’ll use the lineitem table, which has the largest data volume and can not be loaded in memory, to test this solution by pre-loading the seven relatively small dimension files and leaving lineitem, the fact file, to be loaded in real time during the query.