Category: SPL application scenarios

How to Enhance Performance of Multidimensional Analysis Involving a Lot of Dimension Tables?

In a multidimensional application, a fact table always has a lot of dimension tables. As the following shows, the orders table has a series of dimension tables:

The association between any of the dimension tables and the fact table is the JOIN defined in SQL. Technically, databases use the HASH JOIN algorithm to achieve the joins, implementing one at a time. N times of HASH JOIN are needed when there are N joins, and each intermediate result set will be retained for use in the next join. The computing process is complex and involves multiple traversals, markedly dragging down the performance. continue reading →

continue reading →

How Java Handles Data Returned from WebService/RESTful

The XML/JSON returned from WebService/RESTful is multilevel data, which is easy to parse yet difficult to compute. The popular approach is to parse XML/JSON using an open-source class library, such as Dom4J and JsonPath, and then compute data with XPath or JsonPath. The approach is lightweight and integration-friendly, but it can manage conditional queries only and hardcodes all the other computations. If high computing performance is required, we can store XML/JSON in a small database, such as SQLite and MySQL, and parse data using the corresponding special function and perform computations in regular SQL. This involves a heavy framework and a complicated and time-consuming loading process. Moreover, SQL performs far better in computing structured records than it does in handling multilevel data.

An ideal alternative is esProc SPL, an open-source class library. It has a lightweight integration framework for parsing data returned from WebService/RESTful directly and rich functions that enable capabilities for computing multilevel data. continue reading →

continue reading →

How Java Performs JOINs on MongoDB

The MongoDB official query language can handle joins. It achieves left joins using $lookup function and other types of joins by modifying $lookup statements. The cross join, however, almost cannot be coded through such modifications. Besides, the JSON-style syntax is complex, diffuse, and difficult to use.

Calcite that supports general SQL is able and gives rather good support for various types of joins. But it has big issues. The overall computing ability is extremely weak – even without fuzzy queries and date functions. It does not support direct joins within a multilevel collection; rather, it will re-structure the collection into single-level collections. continue reading →

continue reading →

How Java Queries or Analyzes MongoDB Data

One approach is to use MongoDB’s official JSON-style syntax. It is reliable and stable, but the syntax is strange and difficult to learn, has inadequate computing ability with no support even for many basic operations, and produces rather complex code. Another is to use Hibernate Criteria of functional programming paradigm. The method is closer to natural language and easy to understand, but it has a heavy framework and weak computing ability. The third one is to use Calcite that supports the universal SQL. Both the costs of learning and computing ability are low. Besides, configurations are complicated, and users need to structuralize the collection.

esProc SPL is a better alternative. The open-source class library boasts remarkable computational capabilities, is easy to learn, does not require structuralizing the collection, has a light integration framework and is simple to configure. continue reading →

continue reading →

How Java Executes SQL on MongoDB

One approach is to start MongoDB Connector and execute SQL through MySQL JDBC. The official tool is perfectly compatible, almost without configurations. Yet its computing ability is weak – the official advice is to use it only for BI tools.

An alternative is using Calcite class library to execute general SQL on MongoDB. The solution is easily migratable but offers extremely weak computing ability. It does not support even the fuzzy query and date functions, needs to convert all multilevel collections into single-level ones – otherwise, some computations cannot be achieved, and is characterized by complicated configurations. continue reading →

continue reading →

Is OLAP Pre-Aggregation for Speeding up Multidimensional Analysis Reliable?

Multidimensional analysis usually involves large data sets. It is inefficient to summarize original detail data ad hoc for performing business analysis. So, pre-aggregation is used to speed up queries by pre-preparing the desired result sets, which will be directly retrieved to guarantee a real-time response for interactive analysis.

About pre-aggregation strategies

Pre-aggregation is a way of achieving very short response times through highly disk-consuming data pre-summarization. In theory, we can pre-summarize all combinations of dimensions for use to handle multidimensional analysis scenarios based on any of the combinations. In practice, it is hardly feasible. A 50-dimension full pre-aggregation takes up storage space up to 1MT (if we take each intermediate CUBE as 1KB, though actually they are much bigger), which is a total of one million 1T HDD!! Even if we reduce 50 dimensions to 20 for the pre-aggregation, the storage space is still expected to remain 470000T, which is equivalent to hundreds of thousands of 1T HDDs. continue reading →

continue reading →

How to Make TopN Operations Fast?

The algorithm of achieving SQL TopN operations is to sort records and get top N values or records. Some databases provide good optimization strategies, avoiding sorting the entire set to get the top N and acquiring higher performance. Yet facing the complicated scenario of getting the top N from each group after the grouping operation, SQL is hard to code the algorithm and database optimization is even harder to implement. Generally, sorting the whole set is practically unavoidable. The sorting, however, is costly. Performance will be seriously affected, particularly when data cannot fit into the memory and external data exchanges are needed. Besides, it is difficult to handle sorting on a whole set with parallel processing, and databases cannot make good use of the parallel processing to achieve TopN operations.

Performance of TopN operations will be greatly increased if we can manage to get rid of sorting on an entire set. One algorithm is to keep a small set with the length of N, and, during traversal, store the obtained desired values or records in the set. The specific steps are: Insert the newly-retrieved record into the small set and remove the record containing the current Nth value if the new target field value is larger than the latter, otherwise remain unchanged. This way data will be traversed only once without sorting. Usually, the size of desired top n is relatively small and can be wholly stored in the memory, making external data exchanges unnecessary. The approach also makes segment-based parallel processing easy to implement. It gets top n from each segment, concatenates all results of top n, and obtains the final top n values or records, making full use of the parallel processing to acquire better performance. continue reading →

continue reading →

Which Columnar Storage Scheme Is Best Suited to Parallel Processing?

There will be time-consuming hard disk scanning and reading when the volume of data to be processed is large. Columnar storage, used when there are a lot of columns but only a few will be retrieved for the computation, can reduce hard disk accesses and enhance performance. That’s why many data warehouse products use columnar storage.

Yet there is the issue of non-synchronized segmentation with columnar data when we are trying to handle it with multithreaded processing. Dividing a table into almost even segments is a prerequisite for parallel processing. It is simple to segment a table using row-oriented storage. We just divide it according to the data size as evenly as possible and mark the border between two segments with the record ending marker. It is a different thing with a table using the columnar storage, where columns are individually stored and thus need to be divided separately. Moreover, the ending markers of corresponding segments for different columns may not locate the same record because field values have indefinite lengths and data is compressed, causing mismatched data retrieval. continue reading →

continue reading →

Is Columnar Storage Always Faster?

The columnar storage brings higher performance in many scenarios and is utilized by not a few data warehouse products. When large data sets are involved in computations, disk scanning and reading accounts for a good part of the computation time. Columnar storage enables reading the desired columns only when there are a great many columns, but the computation involves a very small number of them, minimizing disk accesses and boosting performance. Yet sometimes the technology is not as fast as expected; quite the contrary, it leads to lower performance than row-oriented storage.

The indexing technique is often used to achieve high-performance searching. An index table stores pairs of key-value and positions of the corresponding record in the original table. The position of the row-oriented storage data record can be represented by a numerical value. When columnar storage is used, each column in a row has its own position. Storing all column positions in the index table will result in costly accesses and heavy space usage. And reading each record of the original table involves multiple column data zones. Hard disks have the smallest reading unit, implying that the column data to be read greatly outnumbers data when it is stored row by row. This causes lower search performance of columnar storage than that of row-oriented storage. continue reading →

continue reading →

How to Parse and Compute XML in Java?

There are three approaches. One is to store XML strings in a database and compute XML in SQL. Its advantage is making the most use of SQL’s computing ability, and there are two disadvantages. SQL, based on two-dimensional structured records, is not good at computing multilevel XML, and has extremely complicated data loading process that results in poor performance. The second is to parse XML using one of the Java libraries, such as XOM, Xerces-J, JDOM and Dom4J and compute XML using XPath syntax. The merit is straightforward and efficient, and demerit is that XPath only supports conditional queries and aggregate operations and all the other computations need to be hardcoded.

The third and best approach is to use esProc SPL, an open-source Java library, to parse and compute XML directly. SPL has data object intended for processing data of multilevel structure, enabling much simpler multilevel XML handling, and functions and syntax possessing great computational capabilities, remarkably facilitating the implementation of complex computational logic. continue reading →

continue reading →