Category: SPL understanding

Why does wide table prevail?

Wide table abounds in BI business. Whenever building a BI system, the first thing you need to do is to prepare a wide table. Sometimes the wide table in the system may have hundreds of fields, splitting this table is often needed due to “too wide” exceeding the limit on the number of fields of database table.

Why are people so keen on creating wide table? There are two main reasons. continue reading →

continue reading →

Technologies for development of data business logic in Java: Hibernate and SPL

To achieve the business logic focused on database in Java, three key elements are required: the objectification of the database table, the ability to calculate the structured data, and the ability to process the flow. Hibernate is an old and well-established technology, and has long since provided the three elements, and achieved a large number of data business logics in many projects. SPL is a new data computing language. Likewise, SPL has the said elements, which can also be used to achieve data business logic. This article will compare them in many aspects so as to find out the one with more efficiency in developing the data business logic. As for other related technologies (such as MyBatis, Querydsl, JOOQ), they are not discussed in this article for reasons like code coupling, computing capability and maturity.

Basic features

Programming style

Hibernate is provided with two programming styles that differ greatly: one belongs to the object-oriented programming, which uses Java's EntityBean and if/switch/while/for statements to process the structured data objects and the flow; the other uses HQL that is close to the programming style of SQL to compute the structured data. In contrast, SPL greatly simplifies the object orientation, which has the concept of objects, and can use the dot to access attributes and perform multi-step calculation. However, SPL cannot be regarded as a complete object-oriented language for it does not have related contents like inheritance and overloading. continue reading →

continue reading →

What is the key to make the multi-tag user profile analysis run faster?

The analysis of user profile needs to use many tags to describe the user attributes. Usually, there are two types of tags, one of which may have multiple values, for example, the educational background tag of user includes the middle school, university, graduate, Ph.D., etc., and the age range tag includes the children, juvenile, youth, middle age and old age, we call this type of tag the enumeration tag. Another type of tag has only two values, yes and no. For this tag, we can use yes/or to describe the user, for example, use yes or no to describe whether the user is registered, active, white-collar worker, target user of a certain promotion program, etc. Such type of tag is called the binary tag.

In the user profile analysis scenarios, it often needs to perform a filtering calculation on the combined conditions in these two types of tags, for example: find out the users who meet the following conditions: middle-aged, university educational background, registered, active, targeted user during last year's Black Friday Sale. continue reading →

continue reading →

Is distributed technology the panacea for big data processing?

Using distributed cluster to process big data is the mainstream at present, and splitting a big task into multiple subtasks and distributing them to multiple nodes for processing can usually achieve significant performance improvement. Therefore, whenever it is found that the processing capability is insufficient, adding nodes to expand the capacity is the easiest way for many supporters to think of. As a result, when we are introduced to a new big data processing technology, the first question we often ask is whether it supports distributed technology and how large a cluster it can support, which indicates that “distributed thinking” is already deeply rooted in our minds.

So, is distributed technology really the panacea for big data processing? continue reading →

continue reading →

Is SPL more difficult or easier than SQL?

SPL, as a technology specifically used for processing structured data and semi-structured data, is often faster than SQL by several to hundreds of times in practice, much shorter in code length, and has obvious advantages, especially when dealing with complex calculations. These advantages attract strong attention from users. However, they worry about the difficulty of mastering SPL. After all, the concept and syntax of SPL are quite different from SQL, and they need to re-understand some concepts and learn new syntax, which makes them hesitant.

So, how difficult is it to get started with SPL? Let's begin with SQL to discuss the problem. continue reading →

continue reading →

Technologies for development of data business logic in Java: JOOQ and SPL

Many open-source technologies can implement the business logic focused on database in Java. Among them, JOOQ is more powerful than Hibernate in computing power, and more powerful than MyBatis in migratability, thereby getting increasing attention. Likewise, esProc SPL, as a new data computing language, is also outstanding in terms of computing power and migratability. This article will compare them in many aspects so as to find out the one that is more efficient in developing data business logic. JOOQ commercial edition mainly supports commercial database and stored procedure, which will not be discussed in this article.

Language features

Programming style

JOOQ supports complete object-oriented programming style, which can combine multiple objects (methods) together to form a SQL-like syntax logic. JOOQ can use Java's Lambda expression, function call interface, and flow control syntax, and also supports function-oriented and
procedure-oriented in theory. However, since these expressions/syntaxes are not specially designed for JOOQ's structured data objects (Result), it is inconvenient use them. continue reading →

continue reading →

Why batch jobs are so difficult?

The detail data produced in the business system usually needs to be processed and calculated to our desired result according to a certain logic so as to support the business activities of enterprise. In general, such data processing will involves many tasks, and it needs to calculate in batches. In the bank and insurance industries, this process is often referred to as “batch job”, and batch jobs are often needed in other industries like oil and power.

Most business statistics require taking a certain day as termination day, and in order not to affect the normal business of the production system, batch jobs are generally executed at night, only then can the new detail data produced in production system that day be exported and transferred to a specialized database or data warehouse to perform operations of the batch jobs. The next morning, the result of batch job can be provided to business staff. continue reading →

continue reading →

The current Lakehouse is like a false proposition

From all-in-one machine, hyper-convergence, cloud computing to HTAP, we constantly try to combine multiple application scenarios together and attempt to solve this type of problem through one technology so as to achieve the goal of simple and efficient use. Lakehouse, which is very hot nowadays, is exactly such a technology; its goal is to integrate the data lake with the data warehouse to give play to their respective value at the same time.

The data lake and data warehouse have always been related closely, yet there are significant differences between them. The data lake pays more attention to retaining the original information, and its primary goal is to store the raw data “as is”. However, there are a lot of junk data in the raw data. Does storing the raw data “as is” mean that all the junk data will be stored in data lake? Yes, the data lake is just like a junk data yard where all the data is stored, regardless of whether they are useful or not. Therefore, the first problem that the data lake faces is the storage of massive (junk) data. continue reading →

continue reading →

Hadoop/Spark is too heavy, esProc SPL is light

With the advent of the era of big data, the amount of data continues to grow. In this case, it is difficult and costly to expand the capacity of database running on a traditional small computer, making it hard to support business development. In order to cope with this problem, many users begin to turn to the distributed computing route, that is, use multiple inexpensive PC servers to form a cluster to perform big data computing tasks. Hadoop/Spark is one of the important software technologies in this route, which is popular because it is open source and free. After years of application and development, Hadoop has been widely accepted, and not only can it be applied to data computing directly, but many new databases are developed based on it, such as Hive and Impala.

The heaviness of Hadoop/Spark continue reading →

continue reading →

Data processing engine embedding in Java: esproc SPL, a Competitor of SQLite

Many free open-source data processing engines can be embedded in Java application, among which, SQLite has been used for a long time, and used by many users, and esProc SPL, as a rising star, is also strong in functionality. This article will compare them in many aspects.

Basic features

Language style continue reading →

continue reading →