SPL concepts for beginners

SPL concepts for beginners

As a programming language, SPL has some inherent characteristics. If you do not understand them, you may not have a problem in hands-on learning, but it is difficult to fully utilize its functionality in architecture design or coding. Here we attempt to establish a basic SPL conceptual framework for beginners, and you can refer to relevant documents for more information based on actual needs.

These contents are not too few and require several hours of careful reading and understanding.

I Application Structure

Firstly, it is necessary to understand the operating environment and application structure of SPL.

1. Script and Development

SPL is an interpreted executing programming language, and the code is usually not large and is commonly referred to as a script.

SPL scripts are generally submitted as files with the extension of .splx, not in text format. SPL also supports script files in text format with the extension .spl.

SPL provides a specialized IDE for writing SPL scripts, which supports regular debugging functions. IDE is a client program that can run on Windows, Linux, and Mac.

2. Environment and Integration

SPL software can be divided into three parts: IDE, JDBC, and Server. All are developed in pure Java and can run in any JVM environment with JDK1.8 or higher versions.

SPL JDBC is not an independent process, it consists of jars and can be imported into Java applications and provide standard JDBC interfaces to applications.

Comparison of JDBC drivers between SPL JDBC and traditional RDB JDBC:

1) SPL JDBC includes a complete computing library, which can provide computing services independently and no longer rely on a server to work.

2) The accepted statement is SPL instead of SQL, and executing an .splx script is similar to using SQL to execute a stored procedure in RDB.

SPL Server is an independent process that can provide external services based on the HTTP protocol, which can be invoked by non-java applications. It can also be accessed and invoked by SPL JDBC.

Multiple SPL servers can form a cluster. Many big data solutions emphasize the role of clusters, but SPL's multiple practices have shown that a single machine can solve the vast majority of tasks. Unless there is a high concurrency requirement, clusters are rarely used, and beginners do not need to pay attention to clusters.

3. Data sources and external libraries

SPL supports diverse data sources, and any data source with a Java interface can be accessed and computed.

SPL has built-in text files, Excel files, JDBC driven relational databases, and HTTP/RESTful access interfaces, allowing direct access to these data sources.

SPL provides various external libraries for other data sources. After importing the corresponding external libraries, the corresponding data sources can be accessed in SPL scripts, and the specific read and write capabilities are determined by the data source itself. SPL has already provided corresponding external libraries for common data sources in the industry .

SPL abstracts data sources into two forms: table sequences and cursors, table sequences will be read into memory at one time for processing; If the data source supports it, SPL can also use cursors to gradually read in data and process it.

SPL does not require all external data to be read into memory for calculation. Don't understand esProc as an in-memory computing engine. Of course, it can maintain a certain amount of memory data for use as an in-memory computing engine.

SPL does not have the concept of a "(data)base", and any accessible data source is logically equivalent, with only differences in the functionality and performance of the data source itself. esProc is not responsible for the management and security of the data itself, only for computation.

II Script and syntax

The basic logic and concepts of SPL code are similar to most programming languages and are very simple for experienced programmers, but there are still some SPL specific contents that need attention.

1. Variable

SPL can use regular variable names, but it is more recommended to directly use cells as variable names.

The cell names referenced in the expression will automatically transition during editing , but those in the string will not. When using $[...] to represent the string, the cells involved will also transition during editing.

SPL variables have genericity and do not need to declare data types, so what is calculated is what it is.

Variables can also be used temporarily in SPL expressions.

2. Function

SPL has a large number of syntax for function options to distinguish between slightly different running states of a function. Options can be used in combination, and special attention should be paid to the relevant options when understanding function functionalities.

replace (s, s1, s2) Replace substring
replace@1 (s, s1, s2) Replace only the first one
replace@c (s, s1, s2) None case sensitive during matching
replace@c1( s, s1, s2 ) None case sensitive during matching and only the first one is replaced

SPL uses cascade parameters to describe complex structured parameters, with parameters at each level separated by colons, commas, and semicolons. This is significantly different from using only commas to separate parameters in regular languages. When understanding function definitions, it is important to pay attention to the cascade parameter hierarchy.

if (a, b, c) 	Regular one layer parameters, comma
between (x, a: b)	Two-layer parameters, comma and colon
hash (xi,...; n) 	Two-layer parameters, comma and semicolon
case (x1: y1, x2: y2,...; y) 	Three-layer parameters

3. Object

SPL has the concept of object, and many functions are encapsulated on objects.

SPL does not have important object-oriented mechanisms for inheritance and overloading. Strictly speaking, SPL is not an object-oriented language. In this sense, SPL is simpler than Java or even Python. The main function of SPL objects is to encapsulate a series of related functions.

4. Lambda Syntax

SPL supports Lambda syntax, and can be said to be a functional language in a sense (it also supports process).

SPL does not support defining variable names for Lambda syntax, but instead uses fixed symbols to represent variables in Lambda syntax. ~ represents the current member, # represents the sequence number of the current member, and ~[i] and ~[a: b] can also be used to represent adjacent members and subsets.

A.sum( ~*~ ) 		Calculate the sum of squares
A.sum( if(#%2==0,~) )	Calculates the sum of members in even positions
A.max( ~-~[-1] )  		Calculate the maximum difference value 
A.( ~[-1,1].avg() ) 		Calculate moving average

Whether a certain parameter can be written in Lamdba syntax depends on the function itself, and attention should be paid when understanding the function definition.

Refer to Lambda syntax in SPL

5. Macro

SPL is a dynamic language that can use macros to temporarily change statements during execution.

Like SQL, when it comes to table and field names in SPL statements, they can mostly be written directly without the need to be written in strings, such as:

file("employee.txt").import@t( ID, Name, Gender, Department )

The ID, Name, Gender, and Department fields can be read from employee.txt to form a data table, making coding more convenient and readable.

However, if you want the parameters of the import function to change (possibly changing the fields to be read based on the code context), it is not very convenient to describe it. Write it as

file("employee.txt").impport@t( "ID, Name, Gender, Department" )

The import function will process the entire 'ID, Name, Gender, Department' as a field name, which is obviously not what we want.

Using macros can solve this problem:


The essence of a macro is a string, which can be concatenated into an executable statement to obtain a new executable statement.

Refer to Macro in SPL

III Data object

There are three most important data objects in SPL, which are the foundation for processing data with SPL.

1. Sequence

A sequence is an ordered set in memory.

The sets in SPL are all ordered, which means a sequence member can be accessed by its sequence number. The members have a clear order in the set, which is consistent with arrays in Java.

The sequence of SPL has generalization, and it is not required that the sequence members have the same data type. The members of the sequence can be of any data type, and the following are legal sequences:


Specifically, when looping through set members, you can directly write

for A
Members of A can be directly referenced in the loop body

It is not recommended to write it as

for i,1,A.len()
Using A(i) to refer to the members of A within the loop

Don’t write it as

for i<=A.len()
Using A(i) to refer to the members of A within the loop

Unlike SQL, SPL provides a data object without data structures, i.e., sequence, which can only be represented as a single field table in SQL.

The aggregation functions of SPL are often designed for sequences, such as filtering, aggregation, grouping, and other operations, without necessarily targeting structured data tables.

Sequences are more fundamental data objects than data tables, and learning SPL requires getting used to using lighter sequences instead of always generating structured data tables.

2. Table sequence

A table sequence is a data table in memory.

SPL inherits the concepts of records, fields, and data tables in SQL, and a table sequence can be understood as a set of records with the same data structure.

The sets in SPL are always ordered, and the table sequence is no exception as a set of records, hence it is called a table sequence.

A table sequence is also a sequence, and members can also be accessed by sequence numbers. It is recommended to use the aforementioned syntax for loops of table sequence members.

The table sequence of SPL also has generalization, and the data structure of records in the table sequence requires to be the same, but there are no restrictions on the values of fields. The values of the same field for different records in the table sequence can be of different data types (although it is not common).

Specifically, the field values can even be another table sequence. This way, it is easy to represent JSON style multi-layer data structures in SPL.

A SPL record is also a type of data object that can be extracted from the table sequence `and used for operations. This is different from the concept of SQL, which does not have a record data type. A single record is actually a data table with only one record.

SPL also allows records in the table sequence to form a new set, and the sequence composed of records is called a record sequence.

Specifically, a subset filtered out by certain criteria from a certain table sequence (the WHERE operation in SQL, the select function in SPL) is a record sequence. The member records that make up this record sequence are the same object as the records in the original table sequence. Changing the field values in these records will cause the corresponding record values in the original table sequence to also change. SPL will not copy these records, which can achieve better computational performance and occupy less memory space. This is very different from SQL, as the result set of the WHERE operation in SQL has nothing to do with the original data table and is a new record that has been replicated.

Record sequence also support genericity, and the records that make up a record sequence can come from different table sequences. That is, records with different data structures may be put together for operation, however this situation is not common.

In SPL, it is common for records to exist separately or form a set after being separated from the table sequence and participate in operations, known as discreteness. This is one of the essential differences between SPL and SQL. This is actually consistent with Java's object referencing mechanism, and Java programmers can easily understand this data type.

In SPL, the field values of a record are allowed to be records from another table sequence, which can achieve association and reference between multiple tables. such as

E.switch( DeptID, D:ID) 	Change the DeptID field value in Table E to the corresponding record in Table D
D.switch( Manager, E:ID ) 	Change the value of the Manager field in Table D to the corresponding record in Table E
E.select( DeptId.Manager.Gender=="Female" )	 Now it is possible to directly reference the field values of records stored in the field values

Similarly, field values can also be another record sequence, which can achieve the structure of the primary and sub tables.

D.derive( E.select( DeptID==D.ID ):Employee )  	Add the Employee field to the D table, the value is a record sequence composed of the department’s employee records
D.select( Employee.len()>10 ) 		Use this field whose value is a record sequence

This type of inter table reference relationship achieved through records and record sequences is common in SPL and not found in SQL, but it is easy to achieve this effect in Java.

3. Cursor

External storage is not as suitable for random access as memory, and often can only provide streaming access mechanisms. Therefore, calculation schemes need to be designed under this premise. There are also many operations that do not require all source data to be loaded into memory, as long as the data is gradually read in for cumulative calculations, such as summation. This allows for processing larger amounts of data with smaller memory.

In SPL, data objects that can be streamed in are abstracted as cursors, and the basic conceptual features of SPL cursors are similar to database cursors.
Like database cursors, SPL cursors are also unidirectional and can only read data forwards until all data is retrieved, and cannot be rolled back.

The data read by SPL cursors is usually organized into table sequences or sequences (table sequences are more common), unlike SQL, which can only read one record at a time, SPL provides a method for batch reading data from cursors.

cs.fetch( 100 ) Read 100 records to form a table sequence
cs.fetch( ;TradeDate<=date("2022-12-31") ) Read records that consistently satisfy the conditions (note that is not to filter all records by the critera)
cs.fetch( ;UserID ) Read the records until the UserID field change

The latter reading method is very useful in many scenarios and is a unique mechanism of SPL.

For debugging convenience, SPL also provides options

cs.fetch@0( 1 ) reads one record, but the cursor position does not move. You can still read this record again next time.

This way, you can first check if the data in the cursor is normal without disrupting the overall code execution.

The SPL cursor not only has a reading function, but can also be used for additional operations, such as

cs.select( ... ) 	Filter
cs.groups( ... ) 	Group

The operation functions on a cursor are very similar in syntax to the corresponding functions on a table sequence (record sequence), ensuring that the codes for in-memory and external storage operations are as consistent as possible. But it should still be noted that these two are not equal.

The operation functions on the cursor can be divided into two categories. One is called delayed calculation. When executing such functions, it is just a mark and does not actually traverse the cursor. Another type is called immediate computation, which essentially traverses the cursor and calculates the result.

cs2=cs.select( ... )	 Delayed calculation, only registering on the cursor that there is a filtering action, and the cursor will not be immediately traversed, returning a new cursor
cs2.groups(...) 	 Immediate computation, performs traversal of the cursor and returns the calculation result. Delayed calculations previously registered on the cursor will also be executed.

This is different from table sequence (record sequence), where the operation functions on the table sequence (record sequence) will immediately execute and return the corresponding results.

Cursors can only traverse once and become invalid after completion, unlike in-memory table sequences that can be calculated over and over. Some cursor-based operations require ordered data, such as intersection, union, and difference operations of sets.

SPL provides a mechanism for multi-purpose traversal, which can calculate multiple results in a single traversal process.

SPL also provides a unique multi-cursor that can split data into multiple segments (related to storage schemes) and traverse them in parallel, fully utilizing the concurrency capabilities of multi-core CPUs and SSDs. Database cursors do not have direct parallel capabilities, making writing parallel code very difficult.

IV Understanding operations

SPL provides a rich class library for structured data computing, including conventional set operations such as intersection, union, and subtraction, as well as filtering, grouping, and join of structured data. Some of these operations have a unique understanding and style of SPL.

1. Loop function

Most operations and even processing actions on sets (sequences/table sequences/record sequences) can be solved using loop functions without the need for complex loop statements. This way, the writing is simple and the performance is also better

p=directory("*.csv") 	List all CSV files in the directory
p.conj( file(~).import@tc() ) 	Read these files and concatenate them

Develop this good habit.

Using the formula e=1+1/1!+1/2!+1/3!+… to calculate the base e of natural logarithm (use the first 20 items)


Understanding this statement is of great help in understanding the loop calculation logic of SPL.

The Lambda syntax mentioned earlier can be used in loop functions. In the Lamdba syntax of table sequences and record sequences, field names can be directly referenced without the need to write ~.

2. Position and alignment

SPL provides many position-related functions.

T.pmax(age)  returns the sequence number of the record with the highest age field value in the table sequence T.
T.pselect@a( age>50 ) returns the sequence numbers of all records with age>50 in table sequence T.

Making good use of it can make the code more concise.

S=T.ptop( 10; -Price )  The positions of the highest 10 days stock price
T.calc( S, Price-Pirce[-1] )  The price increase in those 10 days

And the function of alignment by position, which is equivalent to sorting data in a specified order

T.align( 31, Day ) assigns transaction records to 31 days of the month, and fills in null for dates without transactions to ensure that the result is 31 members.

T.align( ["Sun","Mon","Tue","Web","Thu","Fri","Sat"], WeekDay )

3. Understanding aggregation

Unlike SQL, aggregation operations in SPL are not limited to operations such as SUM/COUNT/MAX/MIN that return a single numerical value. Any smaller scale values calculated from a set can be considered an aggregation calculation.

The result of an aggregation operation may be a small set, such as top, which returns the top N members.

The result of an aggregation operation may also be an object, such as maxp, that returns the record where the maximum value is located.

All aggregation calculations can be used for grouping.

Some examples:

T.top( -3, salary )	 Top three highest salaries
T.top( -3; salary )	 Top three employees with the highest salary
T.maxp( age ) 	The oldest employee
T.groups( Dept; maxp( age ) ) 	The oldest employee in each department

These aggregation operations can also be applied to grouping and aggregation on cursors.

Programmers can also use the iterate function to customize aggregation operations.

4. Grouping

SQL grouping is accompanied by mandatory aggregation for the subsets of the grouping, while SPL does not have this requirement. Grouping and aggregation can be divided into two steps, and the subsets of the grouping can be retained to continue participating in the calculation.

T.group( Birthday ).select( ~.len()>1 ).conj()	Someone with the same birthday as others

SQL only has one equivalence grouping, which is to group members with the same key value into the same group. SPL provides more grouping methods besides equivalence grouping:

Ordered grouping

T.sort( Date ).group@i( Price<Price[-1] ).max( ~.len() ) 	 Scan data in order, group when prices fall, and calculate the number of max consecutive rising days

Sequence number grouping

T.groups@n( Day; sum( Amount) )	 Calculate the daily sales revenue 

Alignment grouping

T.align@a( ["Male","Female"], Gender ) 		Group by gender to ensure two results

Equivalence grouping is a complete partition, where all original members are divided and only assigned to one grouping subset, without the appearance of empty grouping subsets. SPL can support incomplete grouping, allowing some original members to be discarded, allowing empty subsets of the grouping, and even achieving repeatable grouping, where certain original members are divided into multiple grouping subsets.

The de duplication operation (DISTINCT) is essentially grouping (without aggregation operation), which needs to be solved using grouping thinking.

5. Join

SQL defines JOIN as Cartesian products and then filtering, without distinguishing between equivalence and non-equivalence JOINs, and does not require correlation with primary keys. SPL also provides this free and flexible JOIN operation, but it is rarely used.

Equivalence join, which is the most common form of join operations, is a join that filters based on the equivalence of the corresponding fields in the associated tables. SPL further divides equivalence join into two types:

1) Foreign key join: Equivalence join between a field in the fact table and the primary key in the dimension table

2) Primary key join: Equivalence join between the primary key of Table A and the primary key or partial primary key of Table B

Both joins have primary keys involved. Equivalence joins without primary key participation are mostly caused by errors in business logic and data.

The two types of join operations need to be implemented using completely different functions. When applying join operations, it is necessary to clearly distinguish the type of join and find the participating (logical) primary key.

T.switch( C, D:K )	Perform a foreign key join between the C field of Table T and the primary key K of Table D, and convert K into a reference of the record in Table D

T.join( C, D:K, x:F )		Perform a foreign key join between the C field of Table T and the primary key K of Table D, calculate x to the associated record of Table D as F field and join to table T. 

join(T1,K1; T2,K2 )		Perform a primary key join between the K1 field of Table T1 and the K2 field of Table T2, and return a table sequence with the associated T1 and T2 records as fields

The dimension table during foreign key join needs to be randomly accessed and can only be stored in memory in principle. SPL provides specialized functions for huge dimension tables that cannot be loaded in memory.

Cursor based primary key join operations require all participating cursors to be ordered for the primary key.

Reference: Explanation of JOIN Operations Series

V Storage and High Performance

High performance is strongly related to storage, and the first step in achieving high-performance computing is usually to design an appropriate storage schema.

1. Bin file

Bin file is a simple file format that uses the same function as accessing text files, but with different option.

When creating a bin file, there is no need to specify the data structure.

Bin file is equivalent to binarizing text files, eliminating the task of parsing data types, avoiding ambiguity, and being faster than text files.

Small data can be written as bin files, and temporarily generated data can also be written as bin files.

Bin files can support segmentation and create multiple cursors on them for parallel traversal.

Bin files can be appended, cannot be updated, and can only be overwritten.

SPL is not a transactional database. In order to ensure performance, all write operations are not subject to shared conflict management, and cannot be written concurrently or read while writing. Read operations can be concurrent.

2. Composite table

Composite table is a high-performance big data file format of SPL, which will be stored in columnar format by default. You can also specify the use of row-wise format with options. It is recommended for beginners to use the default columnar format.

A composite table is physically divided into data blocks, and each field in the columnar composite table occupies at least one block, making it unsuitable for storing too small amounts of data. The size of one block is usually 1M.

A composite table is a slightly complex file format that requires specifying a data structure when creating it. Specifically, a composite table is usually ordered for some key fields. When creating a composite table, it is necessary to specify the ordered field, called dimension field. For beginners, dimension field can be simply understood as the primary key of the table. The dimension fields must be listed first.

Beginners can simply understand composite tables as database tables that add the requirement for primary key to be ordered and are stored in columns, with each composite table file corresponding to a database table.

Composite tables can also support segmentation, and establish multiple cursors on them for parallel traversal.

The composite table can append data, but after appending, it is still necessary to maintain the order of the dimension fields. If the appending data cannot naturally meet the premise of being orderly to the dimension fields, it may lead to the reordering of new and existing data, and the time required for appending data will be higher. In order to avoid reordering a large amount of data, the composite table provides a complementary table mechanism, which can reduce the amount of data that needs to be sorted and rewritten during daily appending, and only perform a whole reordering after a long period of time. SPL also provides a multi-zone composite table mechanism to segment and store huge amounts of data, reducing the size of each reorder.

After maintaining the order of the dimension field, many high-performance algorithms can be implemented, which is equivalent to exchanging a one-time inefficient sorting for multiple subsequent efficient query calculations.

The composite table can be inserted and modified in small amounts, but it also greatly affects performance, so it is recommended that beginners can simply consider the composite table as unmodifiable Only used to store historical data that will no longer change, do not treat composite tables as tables in the database that can be inserted, deleted, or updated at will. Even if for appending, do not randomly append a small amount of data multiple times in application. Instead, add a large amount of data at one time in a specialized ETL process to maintain storage compactness.

Similarly, there is no shared conflict management for the write operation of composite tables. It is impossible to write concurrently or read while writing. Of course, reading and calculation can be concurrent.

3. Ordered storage for large grouping and association

Ordered storage is the key to composite tables, and selecting appropriate dimension field (ordered field) is the key to subsequent high-performance computation.

Conventional filtering and small result set grouping and aggregation do not require ordered storage, and a suitable ordered solution may improve filtering performance, but it is not critical.

The operations that rely on ordered storage mainly include DISTINCT statistics and primary key JOIN.

In principle, DISTINCT operation needs to keep all key values that have already been traversed in order to determine whether the newly traversed key values are new. When the number of key values is large, it will occupy a huge amount of memory and also consume a lot of comparison time. If the key values are ordered, it is not necessary to maintain the key values that have been traversed in history, but only to compare them with adjacent key values, which greatly reduces both space and time complexity.

The two(or more) tables joined on the primary key are usually very large. If they are unordered, they either occupy a huge amount of memory to read all the data in, or use the bilateral HASH heap algorithm to generate external storage buffers and process them in batches, which cannot obtain algorithms with good space and time complexity. If the data is ordered to the join keys, a simple merge algorithm works, which not only occupies very little memory but also has much less computational complexity than HASH JOIN.

In most cases, the key values involved in DISTINCT statistics and primary key joins are fields similar to user accounts, we won’t randomly specify some fields for such operations. Hence, it is sufficient to maintain one orderly copy of the account data, without the need to maintain multiple redundant copies.

The complex analysis of behavior events within an account can be seen as an extension of DISTINCT operation and primary key joins. These analysis tasks have several characteristics: 1. The account volume is very large, 2. The data volume of each account is not large, and 3. There is no correlation between accounts. For this type of operation, if the data is in order for the account, the data of one account can be subsequently read in using a cursor (which may be associated with multiple tables) (reviewing the cursor read mechanism mentioned earlier) for complex operations, and then read in and process the next account after completion. A typical scenario is the funnel analysis of e-commerce. SPL is very good at this type of operation.

These types of account analysis (including basic DISTINCT statistics and primary key joins) can also utilize multi cursor parallel computation, which requires data to be correctly segmented and ensuing that data from the same account is processed under the same cursor. SPL can use the option to specify that the first field is used as a segmentation field when creating a composite table, ensuring that records with the same segmented field value are not split into different segments when segmenting the composite table.


4. In-memory dimension table and index

Foreign key joins involve random access, and generally cannot be pre sorted and can only be read into memory. However, except for the account table (if regarded as a dimension table), most dimension tables are relatively small. The joins on the account table can be regarded as primary key join and solved through an orderly storage schema.

So, when designing a hardware environment, it is necessary to consider the dimensional table size involved in the computation. The hardware memory should be large enough to load the dimensional table, otherwise the computation will be difficult to perform efficiently. This method can be used to estimate the required memory capacity.

SPL can establish multiple indexes on in-memory dimension tables for quick positioning.

If memory allows, the required dimension tables can be loaded into memory and indexed at once when the machine starts. In the future, there will be no need to instant read and index during each operation, which is very effective for high-performance concurrent query business.

But in some scenarios, it is necessary to consider changes in the dimension table. The conventional approach is to generate multiple dimension tables by date in the database. After having multiple copies of a dimension table, it is usually impossible to read all of them into memory. In this case, it is necessary to read and index them separately during each operation, which cannot be pre loaded and prepared, and performance will be affected. SPL provides a mechanism for time dimension tables, which can only save the original dimension table and changed records of subsequent dimension tables, as well as the time of changes. This greatly reduces memory usage, as changes in the dimension table are rare events. A time dimension table can be like a regular dimension table, and can correctly find associated dimension table records based on the time recorded in the fact table when making foreign key joins.

5. External storage search and index

External storage indexes (including database indexes) are usually only effective for fixed point searches, which return a small amount of data for searches. They are not effective for most operations that require traversing the entire data, so do not expect too much from the index.

SPL provides external storage indexes for composite tables, enabling high-speed search.

Unlike traversal tasks, using row storage for search tasks has more advantages. The records that are found are read together at once, rather than being physically divided into multiple fields and read multiple times like columnar storage.

For account queries with high concurrency requirements, it is also necessary to sort the data by account to ensure that the data of the same account is stored together on the hard disk, so that the data of one account can be read in at one time without wasting the read content. However, indexes are usually logically ordered, and if they are physically out of order, it can still cause the hard drive to jump around and fail to achieve high performance.

Leave a Reply