Data processing engine embedding in Java: esproc SPL, a Competitor of SQLite

Many free open-source data processing engines can be embedded in Java application, among which, SQLite has been used for a long time, and used by many users, and esProc SPL, as a rising star, is also strong in functionality. This article will compare them in many aspects.

Basic features

Language style

SQLite uses the traditional SQL code (the two are equivalent in this article). SQL is popular in this industry, so there is no need to give a more introduction to it. SPL is a modern data computing language, whose style belongs to the simplified object-oriented language. SPL has the concept of objects, and can use the dot to access attributes and perform multi-step calculation. However, SPL cannot be regarded as a complete object-oriented language for it does not have related contents like inheritance and overloading.

Operation mode

SQLite is a small cross-platform database developed in C language, which can be either embedded in other development languages or executed on a single machine. SPL is a cross-platform data computing language developed in Java, which can be embedded in Java, executed on a single machine, or called remotely in the form of data computing service. Both SQLite and SPL are the interpreted-execution.


SQLite officially only provides the command-line tools, and its graphical tools need to be provided by third parties. However, due to the characteristics of SQL, most of such graphical tools do not have basic IDE functions such as breakpoint debugging and variable observation. SPL offers the graphical IDE, including complete debugging function, variable observation function in the form of table.

Learning difficulty

SQL has a long history and rich materials, and is relatively easy to learn its fundamentals, but difficult to learn when encountering complex operations. Moreover, SQL lacks the flow processing capabilities (branch and loop structures), and only with the aid of corresponding functions in Java can a complete business logic be achieved, so it is usually necessary to learn Java.

Since the objective of SPL is to simplify the codes of both Java and SQL, and many concepts are intentionally simplified, the difficulty of learning SPL is lower. SPL has the structured data computing capability and flow processing capability, and can achieve a complete business logic, without having to use other languages.

Amount of code

The code amount is small when using SQL to perform simple calculations, but when it is used to perform more complex calculations, the code amount will increase dramatically. For SQL, it often needs to use Java to achieve a complete business logic. However, since Java's flow processing function is not optimized for structured data objects, the code is usually cumbersome.

As a modern computing language, SPL avoids many weaknesses of SQL, and the code amount is very small whether it is used to perform a simple or complex calculation. Moreover, the flow processing function of SPL is optimized for structured data objects, the related code is simpler.

Reading of data source

Private data format

SQLite has its own (private) data format, i.e., the database file, and a database file contains multiple tables. The database file can be created with the command of command line or the API (JDBC URL), but not be created with SQL code directly. For example, the following code is to create a new SQLite database file in Java. If there is already a database file, open it directly.

Connection conn = DriverManager.getConnection("jdbc:sqlite: d:/ex1.db");

SPL is provided with two private data formats, one is the composite table file dedicated to big data high-performance computing, this data format is not the one we focus on in this article; the other is the bin file, which is dedicated to small- and medium-data amount and embedded calculation. SPL does not need to build a database, nor does it have the concept of database, and each bin file corresponds to one table.

External data source

SQLite supports only the text data files, including the txt file separated by TAB, the csv file separated by comma, and it can also self-define other separators.

SPL supports multiple data sources, including:

  • JDBC (i.e., all RDBs)
  • csv, TXT, JSON, XML, Excel
  • HBase, HDFS, Hive, Spark
  • Restful, WebService, Webcrawl
  • Elasticsearch, MongoDB, Kafka, R2dbc, FTP
  • Cassandra, DynamoDB, influxDB, Redis, SAP

All these data sources can be accessed directly, so it is very convenient. For other data sources that are not listed above, SPL provides interface specification, and thus, as long as such data sources are exported as the structured data objects of SPL according to the specification, subsequent calculations can be performed.

Access the private data format

SQLite accesses the tables in database file through SQL and generates the SQL result set, i.e., the structured data objects in memory. Generally, reading the table is accompanied by the query:

select * from Orders where Amount>2000 and Amount<=3000

SPL reads the bin file through its function T or import, and generates a table sequence (equivalent to SQL result set). The equivalent code:

T("d:/Orders.btx").select(Amount>2000 && Amount<=3000)

Read the csv file

Three steps are required for SQLite to read the csv file. Step 1: enter the management command line.

#On the Windows command line, open or create a new database named dbname
sqlite3.exe new.db

Step 2: create a new database table on the management command line and import the data:

#When the field types are all strings, there is no need to manually create a table.
create table Orders(
OrderID int,
Client varchar(100),
SellerID int,
Amount float,
OrderDate date
.import --csv d:/Orders.csv Orders

Step 3: embed SQL query code in Java code:

select * from Orders where Amount>2000 and Amount<=3000 

The above method requires manual operation, and is rather cumbersome. Alternatively, reading csv file can be achieved entirely in Java code. Likewise, three steps are required. Step 1: create a text file (such as importOrders.txt) in the operating system, with the same flow with step 2 described above.

Step 2: call the operating system command line in Java code, that is, use Runtime.getRuntime().exec(…) to execute the command:

sqlite3.exe dbname.db < importOrders.txt

Step 3 is the same with step 3 described above.

However, this method requires additional execution permissions from the operating system, and hence the security risk is relatively high. If you want to reduce security risk, you have to use the loop statement to execute SQL, and insert the records one by one, resulting in long code, which is troublesome to modify.

In contrast, SPL only needs one step to read csv file. Embed the following code in Java:

T("d:/Orders.csv").select(Amount>2000 && Amount<=3000)

The function T can read both the bin file and csv file, and generate a table sequence. The table sequence is the structured data object of SPL, which is equivalent to SQL’s result set. When SPL imports the data, the data type can be parsed automatically, and there is no need to specify manually. The whole process does not require manual operation, and the permission requirements are low, and the code is short. Therefore, SPL is much more convenient than SQLite.

If the format of csv file is not standardized, you can also use the function import to specify the separator, field type, number of skipped lines, and deal with the escape character, quotation marks, parentheses, etc. Thus, SPL has much richer functions than that of SQLite.

For data sources other than csv, SQLite does not provide any convenient import method, the data-fetching process is thus very cumbersome, while SPL supports a variety of data sources, and the fetching process is simple and convenient, which can significantly improve development efficiency.

Read the multi-layer structured data

Json and XML are commonly used multi-layer structured data. SQLite is simple in framework, and able to calculate Json, and sometimes assumes the function for computing Json file/RESTful. However, SQLite cannot directly parse Json file/RESTful, and it needs to use Java to hard code, or use the library of third-party to piece together an insert statement to insert into the data table, and hence the code is very cumbersome, which will not be shown here.

Likewise, SPL is simple in framework, and can directly parse Json files/RESTful, the code is therefore greatly simplified, for example:

json(file("d:/xml/emp_orders.json").read()).select(Amount>2000 && Amount<=3000)
json(httpfile("").read()).select(Amount>2000 && Amount<=3000)

SQLite does not have the ability to calculate XML file, nor can it directly parse XML file or WebService, and instead, it can only parse and calculate with the help of external Java code, which is very cumbersome.

SPL can read XML files directly:

2=xml(A1,"xml/row")>1000 && Amount<=2000 && like@c(Client,"*business*"))

SPL can also easily read WebService:

2=ws_call(A1,"RQWebService":"RQWebServiceSoap":"getEmp_orders")>1000 && Amount<=2000 && like@c(Client,"*business*"))

SPL’s table sequence supports multi-layer structured data, and is easier to express Json/XML than the two-dimensional structure of SQL database table, and the calculation code is also simpler. This part of content is not what we focus on in this article, so we don’t discuss it here.

Cross-source computing

Since SQLite only supports one external data source, i.e., the csv file, the cross-source computing is actually the calculation between csv file and database table such as the association, intersection, and subquery. SQL is a closed computing language, and cannot directly calculate the data outside the database, and hence there is a need to import the data into the database, and only turning the csv file into a database table can the cross-source computing be performed. From the previous code, it can be seen that the process of importing data into database is more troublesome, for the reason that using SQL alone does not work, you have to use Java or command line.

SPL has good openness and can directly calculate multiple data sources, and cross-source computing can be easily performed between different data sources. For example, the following code is to left join between the csv file and RESTful:

=join@1(json(httpfile("").read()):o,SellerId; T("d:/Emp.csv"):e,EId)

It is easier to read in multi-step form:


In short, the cross-source computing can be implemented by using SPL alone, without the help of Java or command line. Compared with SQL, SPL code is shorter and easier to understand, and the development efficiency of SPL is much higher.

Data calculation

Basic calculations

Common basic calculations supported in SQLite:

#Select some of fields
select Client,Amount from Orders
#Fuzzy query
select * from Orders where Amount>1000 and Client like '%s%'
select * from Orders order by Client, Amount desc
select distinct Client from Orders
#Grouping and aggregating
select strftime('%Y',OrderDate) as y, Client, sum(Amount) as amt from Orders group by strftime('%Y', OrderDate), Client having amt>3000
select * from Orders9 where Amount>3000
select * from Orders9 where strftime('%Y',OrderDate)='2009';
select * from (select strftime('%Y',OrderDate) as y, Client, sum(Amount) as amt from Orders group by strftime('%Y',OrderDate), Client) where  Client like '%s%';

Common basic calculations implemented in SPL:

AB,Amount)// Select some of fields>1000 && like(Client,\"*s*\"))// Fuzzy query
3= Orders.sort(Client,-Amount)// Sorting
4= Deduplication
5=Orders.groups(year(OrderDate):y,Client;sum(Amount):amt).select(amt>3000)// Grouping and aggregating
6=[>3000),].union()// Union
7=Orders.groups(year(OrderDate):y,Client;sum(Amount):amt).select(like(Client,\"*s*\"))// Subquery

For basic calculations, both SQLite and SPL have rich functions, and it is not difficult to learn and master each of them.

Comprehensive calculation

Let's start with two simple examples. Example 1: calculate the TopN, SQLite code:

select * from Orders order by Amount limit 3

This code is short, but since SQLite does not support the top function, nor does it support the pseudo sequence number column (like Oracle's rownum), it has to use the limit function instead to achieve the calculation objective. Consequently, it is not intuitive to understand this code.
SPL code:;Amount)

SPL supports the real row number, as well as the top function, making the code shorter and easier to understand.

Example 2: calculate the in-group TopN, SQLite code:

select * from (select *, row_number() over (partition by Client order by Amount) as row_number from Orders) where row_number<=3

This SQL code is slightly complex, mainly because SQLite does not support the top function, yet the limit function limits only the total number of records, not the number of records in each group. In this case, the in-group row number must be used. Moreover, SQLite does not have the real row number field, and hence it needs to generate the pseudo column using window function first, the code is thus complex. In fact, the lack of row number is not a problem of SQLite itself, but a problem of entire SQL system, for other databases do not have row number as well.

SPL code for calculating in-group TopN:;Amount)).conj()

We can see that the SPL code is much simpler and easier to understand. First, group by Client, then calculate the Top3 for each group (represented by the symbol ~), and finally concatenate the calculation results of every group. The reason why SPL code is simpler is apparently because that it directly supports the Top function, but in essence, SPL has the real row number field, in other words, SPL supports the ordered set. Another reason is that the set orientation operations in SPL is more thorough, which can implement real grouping (grouping only, not aggregating). In this way, the in-group data can be calculated intuitively. In contrast, SQL cannot support a thorough set orientation, and hence the aggregating operation must be attached when grouping. As a result, the in-group data cannot be calculated intuitively, SQL has to use the window function instead.

Let's look at some more complicated calculation examples. Example 1: calculate the maximum number of days that a stock keeps rising, SQLite code:

select max(continuousdays)
from (
select count(*) continuousdays
from (
select sum(risingflag) over (order by day) norisingdays
from (
select day, case when price>
lag(price) over (order by day) then 0 else 1 end risingflag from tbl
) group by norisingdays

As you can see, this code is long and complex, for the reason that it is difficult for SQL to directly express the concept of consecutive rising, it has to achieve the calculation objective in an indirect way, that is, first calculate the cumulative number of days that a stock does not rise, and then use the cumulative result to calculate the number of days we want. This method requires strong skill, and it is difficult to write and understand the code. Moreover, it is difficult to debug SQL, resulting in a difficulty in maintenance.

SPL code:


This code is much simpler, because it is easy for SPL to express the concept of consecutive rising, which takes only two steps: sort by date; traverse the records, if it is found that the stock rises, the counter is incremented by 1. This code uses both the loop function max and the ordered set. In this code, [-1] represents the previous record, which is the representation method of the relative position, and price[-1] represents the stock price of the previous trading day, which is more intuitive than moving the whole row (lag function).

Example 2: find out the top n customers whose cumulative sales account for half of the total sales, and sort them by sales in descending order. SQLite code:

with A as
(select client,amount,row_number() over (order by amount) ranknumber
from sales)
select client,amount
from (select client,amount,sum(amount) over (order by ranknumber) acc
from A)
where acc>(select sum(amount)/2 from sales)
order by amount des

This code is complex, for the reason that it is difficult for SQL to handle the customer who is exactly in the critical position. To solve this problem, it has to use an indirect way, that is, first calculate the cumulative sales in ascending order, and then find out the customers whose cumulative sales is not in the second half of calculation result. This method has the disadvantages of requiring strong skill, long code and difficult to debug.
SPL code:

2=sales.sort(amount:-1)/Sort the sales in descending order, which can be done in SQL
3=A2.cumulate(amount)/Get the sequence of cumulative amounts
4=A3.m(-1)/2/The final accumulative amount is the total
5=A3.pselect(~>=A4)/Get the position of the records where the amount takes up at least half of the total
6=A2(to(A5))/Get the records by A5’s position

We can see that this code is relatively simple. The set orientation in SPL is more thorough, as it can use the variables to express the set conveniently, and continue to use variables to reference the set in the subsequent calculation, therefore, SPL is particularly suitable for multi-step calculation. Dividing a big problem into multiple small steps makes it easy to achieve complex calculation objective, and the code is not only short, but easy to understand. In addition, multi-step calculation naturally supports debugging, which virtually improves development efficiency.

The above examples indicate that SQL is only suitable for relatively simple calculations, while SPL supports the ordered set, supports the set orientation more thorough, and is able to implement both the simple and complex calculations very well. Moreover, SPL supports discrete records, and can use the dot notation to reference the associated table intuitively, thereby simplifying complex association calculations.

Date and string functions. SQLite provides the date and string functions, such as: get the date before or after a specified date, truncate the string, but the number of functions provided in SQLite is not rich, and it does not provide many commonly used functions like getting the date before or after a specified number of quarters, getting the date after N workdays.

SPL offers richer date and string functions, far exceeding SOLite in both number and functionality. For example:

Get the date before or after quarters: elapse@q("2020-02-27",-3)            //Return 2019-05-27
Get the date after N workdays: workday(date("2022-01-01"),25)	          //Return 2022-02-04
Check whether a string all consists of numbers: isdigit("12345")	          //Return true
Get a string before a specified substring: substr@l("abCDcdef","cd")	  //Return abCD
Split a string into an array of substrings by vertical bar: "aa|bb|cc".split("|")  //Return ["aa","bb","cc"]

SPL also supports many other functions, for example, get a date before or after years, get which quarter the date belongs to, split a string according to a regular expression, get the where or select part of a SQL statement, get words from a string, split HTML by the specific marker, etc.

SPL function options and cascaded parameter

It is worth mentioning that in order to further improve the development efficiency, SPL also provides the unique function syntax.

When there are a large number of functions with similar functionality, SQL can only use different names or parameters to distinguish, which is not convenient in use, whereas SPL provides very unique function option, which allow the functions with similar functionality to share one function name, and their difference can be distinguished just by an option. For example, the basic function of the select function is to filter, if you only want to filter out the first record that meets the conditions, you can use the option @1:>1000)

When using the binary search to quickly filter the sorted data, you can use the option @b:>1000)

The options of function can also be used in a combined way, for example:>1000)

Some parameters of structured operation function are very complex. SQL needs to use various keywords to separate the parameters of one statement into multiple groups, but this will involve many keywords and also make the statement structure inconsistent. SPL uses the cascaded parameter, which simplifies the expression of complex parameter. It can divide the parameters into three layers from high to low by semicolons, commas and colons:

join(Orders:o,SellerId ; Employees:e,EId)

Data persistence

SQLite implements the data persistence by directly processing the database table, there are three types of processing methods: addition, modification, and deletion:

insert into Orders values(201,'DSL',10,2000.0,'2019-01-01')
update Orders set Client='IBM' where orderID=201
delete from Orders where orderID=201

Batch adding of the new data is a common requirement. SQLite code:

insert into Orders(OrderID,Client,SellerID,Amount,OrderDate) 
select 201,'DSL',10,2000.0,'2019-01-01'
union all
select 202,'IBM',10,3000.0,'2019-01-01'

There are two methods to implement the persistence of SPL’s bin file. One is to process the bin file directly, which can achieve the addition of records (or batch addition). For example:


The other is to process the in-memory table sequence first, and then overwrite it to the original bin file. This method can achieve the addition, modification and deletion of records, and the code is not much different from the above code, the only thing you need to do is to modify the export@ab in A3 to export@b, @a means appending, @b means bin file format. Although the performance of this method is not as good as that of SQLite, the amount of data in an embedded calculation is generally small, and the overwriting speed is usually acceptable.

SPL composite table supports high-performance addition, deletion and modification of records, and is suitable for high-performance computing with large data amount, which is not the focus of this article.

SQLite supports only the persistence of database table, and does not support the persistence of other data sources including csv file. If you insist on persisting other data sources using SQLite, you have to hardcode in Java, or use the library of third-party, which will make the code very cumbersome.

In addition to supporting the persistence of bin file, SPL also supports the persistence of other data sources. Likewise, SPL uses the table sequence as the medium.

file("d:/Orders.csv").export@t(A2)		//csv file
file("d:/Orders.xlsx").xlsexport@t(A2)		//xls file
file("d:/Orders.json").write(json(A2))		//json file

In particular, SPL supports the persistence of any database. For example:

1=connect("orcl")/Connect to external oracle
2=T=A1.query("select * from salesR where SellerID=?",10)/Batch query table sequence T
3=NT=T.derive()/Copy and generate the new table sequence NT
4=NT.field("SELLERID",9)/Batch modify new table sequence
5=A1.update(NT:T,sales;ORDERID)/Persist the modification into the database

The persistence of database still uses the table sequence as medium. The advantages of this approach are very obvious: the function update automatically compares the table sequence before and after modification (addition, modification and deletion), which can easily implement the persistence of batch data.

Flow processing

SQLite lacks the flow processing capability and cannot achieve a complete business logic. To solve this problem, it has to adopt an indirect way, that is, first convert SQL’s data objects to Java’s resultSet/List<EntityBean>, then use the for/if statements to process the flow, and finally convert back to SQL’s data objects. As a result, the code is very cumbersome. When the business logic is complicated, it needs to convert between SQL objects and Java objects many times, which is more troublesome.

SPL provides the flow control statements. Using such statements together with built-in structured data objects makes it easy to achieve various business logics.

Branch structure:

3if T.AMOUNT>10000=T.BONUS=T.AMOUNT*0.05
4else if T.AMOUNT>=5000 && T.AMOUNT<10000=T.BONUS=T.AMOUNT*0.03
5else if T.AMOUNT>=2000 && T.AMOUNT<5000=T.BONUS=T.AMOUNT*0.02

Loop structure:

2=T=db.query@x("select * from sales where SellerID=? order by OrderDate",9)
4=A3.CLIENT=CONCAT(LEFT(A3.CLIENT,4), "co.,ltd.")

In addition to the above code, SPL offers more flow processing functions for the structured data, which can further improve the development efficiency, such as: taking a batch of records rather than one record in each round of loop; looping one round when the value of a certain field changes.

Application framework

Java integration

SQLite provides the JDBC interface, which can be easily integrated in Java code:

Connection conn = DriverManager.getConnection("jdbc:sqlite: d:/ex1.db");
Statement statement = conn.createStatement();
ResultSet results = statement.executeQuery("select * from Orders where Amount>1000 and Client like '%s%'");

SQLite's core is written in C language. Although it can be integrated in Java applications, it cannot be seamlessly integrated. Since it will take extra time to exchange the data with Java main program, the performance will be significantly decreased when the data amount is large or the interaction is frequent. Also, since the core is the C program, SQLite will damage the consistency and robustness of the Java framework to a certain extent.

SPL also provides the JDBC interface, and the integration method is similar to SQLite:

Connection conn =DriverManager.getConnection("jdbc:esproc:local://");
Statement statement = conn.createStatement();
ResultSet result = statement.executeQuery("=T(\"D:/Orders.csv\").select(Amount>1000 && like(Client,\"*s*\"))");

Unlike SQLite, SPL is written entirely in Java and can be seamlessly integrated in Java application, which is characterized by strong consistency in framework, more stable system, without having to take extra time to exchange the data, and its performance is more secure.

External placement of business logic

The general RDB supports the stored procedure, and can place the business logic outside a Java program, but SQLite does not support the stored procedure, and usually uses Java's flow processing capabilities to achieve a complete business logic, so it cannot place the business logic outside the Java program. Failure to place the business logic outside the Java code will result in excessive coupling between them.

SPL can achieve a complete business logic, and the business logic (or complex, frequently changed computing code) can be saved as script file and placed outside the Java program. The Java program references the name of script file in the form of a stored procedure:

Connection conn =DriverManager.getConnection("jdbc:esproc:local://");
CallableStatement statement = conn.prepareCall("{call queryOrders()}");

The external SPL script can not only effectively reduce the system coupling, but boasts the characteristics of hot swap. SPL code is an interpreted code, and can run directly after modification without the need to compile, nor does it need to restart the Java application. Thus, the maintenance cost can be effectively reduced.

In-memory calculation

SQLite can be used for in-memory calculation. Generally, the data is loaded into memory when the application is started (URL changes):

Connection conn= DriverManager.getConnection("jdbc:sqlite:file::memory:?cache=shared");
Statement st =conn.createStatement();
st.execute("restore from d:/ex1");

When there is a need to perform business calculation, the previously loaded in-memory data can be used directly:

Connection conn= DriverManager.getConnection("jdbc:sqlite:file::memory:?cache=shared");
Statement statement = conn.createStatement();
ResultSet results = statement.executeQuery("select OrderID,Client,Amount,Name,Gender Dept from Orders left join Employees on Orders.SellerId=Empoyees.EId");

Likewise, SPL can be used for in-memory calculation. When the application starts, execute the script, loading the data into memory (URL doesn't change):

1= connect("orcl").query@x("select OrderID,Client,SellerID,OrderDate,Amount from orders order by OrderID")

When there is a need to perform business calculation, the previously loaded in-memory data can be used directly:

Connection conn =DriverManager.getConnection("jdbc:esproc:local://");
Statement statement = conn.createStatement();
ResultSet result = statement.executeQuery("=join@1(Orders:O,SellerId; Employees:E,EId).new(O.OrderID, O.Client,O.Amount,E.Name,E.Gender,E.Dept)");

If the association calculation occurs frequently, the performance will inevitably decline, but if we use a wide table, it will take up too much memory space, and there is no good way for SQL to solve this problem. Fortunately, SPL has the unique pre-association technology, and can greatly improve the performance of association calculation without occupying extra memory space. Moreover, SPL is provided with more in-memory computing technologies and generally performs much better than SQLite, including the parallel computing, pointer-style reuse, in-memory compression etc.

In short, SQLite can be easily embedded in Java, but the loading of data source is cumbersome, the computing capability is limited, the business logic cannot be achieved independently, and there are many weaknesses in the framework. In contrast, SPL is also easy to be embedded in Java, and directly supports more data sources, and is more powerful in computing capability, convenient in flow processing, and able to achieve the business logic independently. Moreover, SPL provides a variety of means to optimize the framework, and its code can be placed either outside the Java or inside the Java. Furthermore, SPL supports interpreted-execution and hot swap, and has the ability to perform high-performance in-memory calculation.

Leave a Reply