# Is OLAP Pre-Aggregation for Speeding up Multidimensional Analysis Reliable?

Multidimensional analysis usually involves large data sets. It is inefficient to summarize original detail data ad hoc for performing business analysis. So, pre-aggregation is used to speed up queries by pre-preparing the desired result sets, which will be directly retrieved to guarantee a real-time response for interactive analysis.

## About pre-aggregation strategies

Pre-aggregation is a way of achieving very short response times through highly disk-consuming data pre-summarization. In theory, we can pre-summarize all combinations of dimensions for use to handle multidimensional analysis scenarios based on any of the combinations. In practice, it is hardly feasible. A 50-dimension full pre-aggregation takes up storage space up to 1MT (if we take each intermediate CUBE as 1KB, though actually they are much bigger), which is a total of one million 1T HDD!! Even if we reduce 50 dimensions to 20 for the pre-aggregation, the storage space is still expected to remain 470000T, which is equivalent to hundreds of thousands of 1T HDDs.

About how large is the space the pre-aggregated result sets for multidimensional analysis will occupy, refer to Storage volume of pre-aggregation in multi-dimensional analysis.

Since the full pre-aggregation strategy is unfeasible, we might try partial pre-aggregation. First, we find the most commonly-used dimension combinations based on which multidimensional analysis will be performed for the current business. Then we build models according to business requirements and employ engineering optimization methods to further help enhance efficiency. The partial aggregation is a compromise between storage space utilization and query performance. Though inconvenient, at least it reduces the storage costs to a bearable level.

## Problems of partial pre-aggregation

But in reality, the pre-aggregation strategy only applies to the small number of fixed and static query needs even if we choose to ignore the disk usage problem. It cannot handle the complex and dynamic scenarios that are actually more common and prevalent in real-world business situations. Below are four of them:

**1.** **Uncommon aggregate operations**

The pre-aggregation strategy will summarize measure fields to get and store desired aggregate values. If a target aggregate operation is not performed on measures during the pre-aggregation phase, you cannot query the corresponding aggregate values directly from a pre-aggregated data cube. Generally, common aggregate operations, such as count, sum and average, will always be included, but it is likely that the less commonly used ones, like median and variance, will be missed out. There are many such aggregate operations that have business significance, but they are hard to be all pre-handled. For each operation that is not performed and result set is not stored in advance, an ad hoc traversal is needed to get the aggregate values.

**2.** **Composite aggregate operations**

Sometimes the aggregation is, rather than a pure and single aggregate operation like sum or average, a composite one consisting of two or more aggregate operations on different dimensions and levels, such as getting the highest sales in each month and calculating the average of median incomes of different regions. It is impossible to pre-compute this kind of composite aggregates. Even if we were able to work them all out, there would be a quadratic increase in the disk usage, exacerbating the storage space utilization problem. So, traversals are still needed.

**3.** **Conditional aggregation on measures**

At times a condition is added to a measure on which an aggregation will be performed. For instance, we want to know the total amount of orders whose transaction amounts are above 100 yuan. The number 100 is, in most cases, a parameter that will be input ad hoc during the interactive query phase. And pre-summing the order amounts becomes impractical.

Conditional measurements can be associated with each other, such as summing sales amounts of orders where more than 10 items are sold. And there are temporary dimensions generated for measurements, such as calculating the average income for each age group, where the rule of dividing people into different age groups is passed in through a parameter. Both scenarios cannot be pre-summarized, and the direct ability of achieving fast traversals is necessary.

**4.** **Aggregation by time intervals**

Time is a particularly important dimension in multidimensional analysis. Generally, we perform data slicing according to enumerated dimension values (the aggregate value is then computed for each slice). The time dimension is special. Time values can be both enumerated and divided into continuous intervals for data slicing. We can thus query data between any two times/dates/datetimes. The dimension-based pre-aggregation, however, usually computes aggregate values by a fixed dimension (or its level), day or month. But, a time interval from May 8 to June 12 cannot be described by any regular dimension value or level.

It is probably that the summarization involves at least two interval-style dimensions in different levels. We might want to find the total amount of products sold between May 8 and June 12 and produced between January 9 and February 17, for instance. When there is only one interval-style dimension, we can pre-compute values to some extent and perform aggregation through traversal based on the intermediate data cube. Computation becomes extremely complicated if it is a composite query based on two or more interval-style dimensions, and the pre-aggregation strategy becomes powerless.

So, the multidimensional analysis pre-aggregation strategy has not only the more noticeable storage utilization issue originated from huge amounts of detailed data, but also the other key weakness, which is the inability to include or describe the commonly seen uncommon scenarios. The space-for-time approach can increase performance to a degree by successfully handling a fraction of simplest query needs but by suffering serious disk usage problem. It is highly unreliable if we want to use it to improve performance of multidimensional analysis for all scenarios. The most basic thing for an efficient multidimensional analysis is the direct capability of achieving really fast traversals. Even the pre-aggregated data is available, the ability to achieve fast traversal is necessary for giving full play to its advantages.

## Solutions

### Efficient traversal

The ability to achieve cost-effective traversals is indispensable for obtaining flexible multidimensional analysis. The analysis itself is not complex, and the traversal is for performing filtering on one or more dimensions. Conventionally, databases implement filtering just using WHERE and handle related-dimensions-based filtering operations also in a regular way, leading to unsatisfactory performance.

esProc SPL is capable of achieving high-performance dimension-based filtering operations during traversals. The **open-source** computing engine provides different dimension-based filtering mechanisms to deal with various multidimensional analysis scenarios. Let’s look at some.

The **boolean type sequence dimension** is used to optimize multidimensional analysis based on enumerated type dimensions. All dimensions, except for time dimension, have purely enumerated values. They include product, region, type to name a few. The regular IN operator needs multiple rounds of comparisons before locating the eligible data (slice), which is very slow. The more the values specified in IN operator, the slower the filtering operation. One way out is to convert the search query to value-getting query in an effort to improve performance. This esProc optimization mechanism converts enumerated dimension values to integers and then the slicing condition to a sequence of boolean values to which records are aligned during the traversal-based query. With each comparison, the value (true/false or null/non-null) at the corresponding position in the sequence will be directly obtained to decide whether the current record is eligible or not, speeding up the slicing operation.

Refer toPerformance Optimization - 8.4 [Multi-dimensional analysis] Dimension of boolean sequencefor detailed explanation about boolean type sequence dimension.

More relative examples can be found in Multidimensional Analysis Backend Practice 7: Boolean Dimension and Binary Dimension.

The **bit-style tag dimension** is the optimization mechanism targeting the binary dimensions (a dimension whose values are yes/no or true/false, such as a person is married or not, college-educated or not and has a credit card or not). Since the tag dimension has only two types of values, they can be stored in one bit. A 16-bit small number can store 16 tags, which are originally stored in as many as 16 fields. This storage method can significantly reduce the amount of data to be stored, and thus the number of disk reads. Besides, the small integers do not affect the retrieval speed.

Refer toPerformance Optimization - 8.5 [Multi-dimensional analysis] Flag bit dimension for detailed explanation about boolean type sequence dimension.

More relative examples can be found in Multidimensional Analysis Backend Practice 7: Boolean Dimension and Binary Dimension..

**Redundancy sorting** is an optimization mechanism that makes use of the orderliness to increase retrieval (traversal) speed. It sorts all dimension fields according to the order of D1…Dn and stores the result data set, and then sorts them all in the inverse order of Dn…D1 and stores the result data set, which is the redundant one. In this way there is always a data set where a dimension D is located in the first half of the sorted dimension list, creating a larger continuous storage area to make retrieval faster.

Refer toPerformance Optimization - 8.3 [Multi-dimensional analysis] Redundant sortingfor detailed explanation about Boolean type sequence dimension.

More relative examples can be found in Multidimensional Analysis Backend Practice 4: Pre-aggregate and Redundant Sorting.

esProc also offers a lot of other high-performance computing mechanisms, including high-performance storage, order-based calculations, and convenient and efficient parallel processing strategy, which suited to both multidimensional analysis scenarios and the other data processing scenarios. By employing these capabilities, users can achieve truly efficient multidimensional analysis.

### Supplemental pre-aggregation strategies

Yet pre-aggregation is still useful in some relatively fixed query scenarios. esProc also provides its own pre-aggregation capabilities.

**Partial pre-aggregation**

esProc partial pre-aggregation strategy can both choose the best combinations of dimensions for summarization according to the business experience and store the aggregate result sets dynamically according to the front-end query targets. esProc uses some techniques to optimize the pre-aggregation process. They include using the lower level of intermediate CUBE (Cuboid) instead of the original CUBE to do the aggregation, and automatically selecting a smaller or the smallest Cuboid when they contain same desired dimensions.

esProc has T.cuboid(Cube1,D1,…;sum(M1),…)function for creating a CUBE. It is simple to use. Then T.cgroups(D1,…;sum(M1),…) and cgroups functions will automatically get the suitable pre-aggregated data for further computations according to the above-mentioned logic.

**Time-interval-based pre-aggregation**

esProc specially provides the **time-interval-based pre-aggregation** strategy to handle the time-interval-based analysis. In the above we mentioned that you cannot pre-compute values for queries based on dynamic time intervals. But we can still engineer means to facilitate them. Data of whole months within the specified time interval can be retrieved from the pre-aggregated month values and data of less than a month at the beginning and end can be obtained from the pre-aggregated daily values. For queries involving long time spans, the trick can reduce the computation amount by several to dozens of times.

It is easy to achieve in SPL. We just need to add a conditional parameter to the cgroup function:

A | |

1 | =file("orders.ctx").open() |

2 | =A1.cuboid(CubeDay,dt,area;sum(amount)) |

3 | =A1.cuboid(CubeMonth,month@y(dt),area;sum(amount)) |

4 | =A1.cgroup(area;sum(amount);dt>=date(2020,1,22)&&dt<=date(2020,9,8)) |

As we already said, pre-aggregation is a solution to handling simple, fixed multidimensional analysis scenarios only. Most of the common scenarios, however, need a computing engine, such as esProc, to do the efficient direct traversals. esProc’s partial pre-aggregation and time-interval-based schemes are great helps to enhance their performances while reducing the storage costs.

In summary, the best tool or mechanism for performing multidimensional analysis should be widely applicable, high-performance and cost-effective.