Tag: Efficient data analysis engine
SPL Cloud Data Warehouse
The overwhelming majority of the cloud data warehouse services on the market (actually we can say all of them) are based on SQL. After all, data warehouses’ primary responsibility is analytical computing. NoSQL has technological advantages in handling TP but they are not nearly as good as SQL in dealing with AP.
The performance problems of data warehouse and solutions
As the volume of data continues to grow and the complexity of business rises gradually, we are facing a big challenge in data processing efficiency. The most typical manifestation is that the performance problem of data warehouse is becoming more and more prominent when dealing with an analytical task, and some problems occur from time to time such as high computing pressure, low performance, long query time or even unable to find out result, production accident caused by failure to accomplish a batch job on time. When a data warehouse has performance problem, it doesn't serve the business well.
HTAP database cannot handle HTAP requirements
HTAP (Hybrid Transaction and Analytical Process) has become the direction of effort of many database vendors since it was explicitly proposed in 2014. In fact, HATP is not new because when RDB began to emerge in the early years, it was exactly to use one database to perform transaction and analysis at the same time.
Routable computing engine implements front-end database
Many large organizations have their own central data warehouse to provide data service to applications. As business grows, the load on data warehouse continues to increase. To be specific, the increased load comes from two aspects: one is that the data warehouse, as the data backend of front-end applications, will face increasing front end applications and concurrent queries; the other is that since it also undertakes the off-line batch job of raw data, the data volume and computing load will increase as batch job increases. As a result, the data warehouse is often overloaded, causing many problems like too long batch-job time (far exceeding the time limit that a business tolerates); too slow response to on-line query (the users have to wait for a long time, resulting in increasingly low satisfaction). Especially at the end of a month or year when computing load is at its peak, these problems will get worse.
The significance of open computing ability from the perspective of SPL
Relational database provides SQL, so it has strong computing ability. Unfortunately, however, this ability is closed, which means that the data to be calculated and processed by database must be loaded into database in advance, and whether the data is inside or outside the database is very clear. On the contrary, the open computing ability means that the data of multiple sources can be processed directly without having to load them into database.
How to speed up funnel analysis of e-commerce system
In the e-commerce system, the conversion funnel analysis is a very important data analysis calculation. The user of e-commerce platform will conduct multiple operations (events), including page viewing, searching, adding to cart, placing an order and paying, etc. These events occur in a certain order, and the later the event occurs, the fewer the number of users involved in the event, just like a funnel. Usually, the conversion funnel analysis is to count the number of users of each event first, and then do further calculations based on counting result, such as calculating the conversion rate. Since such analysis involves huge data, and the calculation is very complex, it often leads to performance problem.
How to implement fast multi-index calculation
In statistical analysis application, various indexes calculated from detailed data are important data to support business. However, if you want to implement a fast and flexible multi-index calculation, the backend data source will face several challenges.
Can SPL and SQL be integrated?
SQL and SPL are both general-purpose processing technologies for structured data, and each has its own characteristics. Specifically, SQL is highly popularized and widely used, many users have a natural ability to query data with SQL, and it is easy for them to get started once the data engine supports SQL; it is relatively easy to migrate historical programs. SPL is concise and efficient, providing more agile syntax that can simplify complex calculations, while supporting the procedural computing and naturally supporting step-wise coding; the computing system of SPL is more open, making it possible to perform mixed computing for multiple data sources at the same time, and easily obtain higher computing performance with built-in high-performance storage and high-performance algorithms; it is more flexible to utilize, enabling it to be used independently or integrated into applications.
Hadoop/Spark is too heavy, esProc SPL is light
With the advent of the era of big data, the amount of data continues to grow. In this case, it is difficult and costly to expand the capacity of database running on a traditional small computer, making it hard to support business development. In order to cope with this problem, many users begin to turn to the distributed computing route, that is, use multiple inexpensive PC servers to form a cluster to perform big data computing tasks. Hadoop/Spark is one of the important software technologies in this route, which is popular because it is open source and free. After years of application and development, Hadoop has been widely accepted, and not only can it be applied to data computing directly, but many new databases are developed based on it, such as Hive and Impala.
The heaviness of Hadoop/Spark
The goal of Hadoop is to design a cluster consisting of hundreds of nodes. To this end, developers implement many complex and heavy functional modules. However, except for some Internet giants, national communication operators and large banks, the amount of data in most scenarios is not that huge. As a result, it is common to see a Hadoop cluster of only a few or a dozen nodes. Due to the misalignment between goal and reality, Hadoop becomes a heavy product for many users whether in technology, use or cost. Now we will explain the reason why Hadoop is heavy in the said three aspects.
ClickHouse is fast, esProc SPL is faster
ClickHouse, as an open-source analysis database, is known for being fast. Is that true? Let's test and verify it through comparative test.
ClickHouse vs Oracle
First of all, we conduct a comparative test on ClickHouse (CH for short) and Oracle database (ORA for short) under the same hardware and software environment, and use the internationally recognized TPC-H as the test benchmark. The test is to execute the calculation requirements (Q1 to Q22) defined with 22 SQL statements for 8 tables. The test is performed on one machine within 12 threads, and the total data amount is around 100G. Since the SQL statements corresponding to TPC-H are relatively long, we do not list them here.