How to Make TopN Operations Fast?

The algorithm of achieving SQL TopN operations is to sort records and get top N values or records. Some databases provide good optimization strategies, avoiding sorting the entire set to get the top N and acquiring higher performance. Yet facing the complicated scenario of getting the top N from each group after the grouping operation, SQL is hard to code the algorithm and database optimization is even harder to implement. Generally, sorting the whole set is practically unavoidable. The sorting, however, is costly. Performance will be seriously affected, particularly when data cannot fit into the memory and external data exchanges are needed. Besides, it is difficult to handle sorting on a whole set with parallel processing, and databases cannot make good use of the parallel processing to achieve TopN operations.

Performance of TopN operations will be greatly increased if we can manage to get rid of sorting on an entire set. One algorithm is to keep a small set with the length of N, and, during traversal, store the obtained desired values or records in the set. The specific steps are: Insert the newly-retrieved record into the small set and remove the record containing the current Nth value if the new target field value is larger than the latter, otherwise remain unchanged. This way data will be traversed only once without sorting. Usually, the size of desired top n is relatively small and can be wholly stored in the memory, making external data exchanges unnecessary. The approach also makes segment-based parallel processing easy to implement. It gets top n from each segment, concatenates all results of top n, and obtains the final top n values or records, making full use of the parallel processing to acquire better performance.

In essence, the new algorithm treats TopN operations that return a smaller set as a type of aggregate calculation. This reduces the post-grouping, intra-group TopN operations to the regular complexity of a grouping & aggregation operation, such as sum and count, avoiding sorting on the entire set and ensuring good performance.

Yet there is a problem. SQL does not have explicit set data type and cannot express this algorithm.

Still, we can use the open-source esProc SPL to achieve it in order to strongly boost performance of TopN operations.

An example. Below is the SPL sample code for getting top5 records from the whole set:
=file("T.ctx").open().cursor@m(x).total(top(-5,x), top(5,x))
@m option enables multi-threaded processing. Expression top(-5,x) gets records containing the five largest x values, and expression top(5,x) gets records containing the five smallest x values.

Below is the SPL sample code for getting top3 records from each group:
=file("T.ctx").open().cursor@m(x,y).groups(y;top(-3,x), top(3,x))
The code groups records by y field, and get records containing three largest x values and those containing three smallest values from each group.

Tests show that, under the same hardware environment, performance of SPL TopN operations on an entire set is as good as that of Oracle SQL ones on the same set, yet for achieving intra-group TopN operations, SPL is over 9 times faster than Oracle. The results means that Oracle offers optimization for entire set TopN operations but that the optimizer fails to work for getting TopN from each group and sorting is again used.

Find detailed SPL optimization techniques and more comparison tests in Performance Optimization Techniques:For TopN

Leave a Reply