# New association calculation methods of SPL

Association calculation in SPL - In-memory join” presents the classification of association calculations in SPL and the programming methods for in-memory join.

Association calculation in SPL - external storage join” presents the programming methods for external storage join.

This article will continue to present new association calculation methods of SPL, including the fjoin function and composite table cursor association & filtering mechanism for foreign key join, as well as the pjoin and new/news functions for primary key join.

When used in appropriate scenarios, these new methods can achieve better performance than those introduced in the previous two articles. However, the previous methods are simpler in concept and easier to understand, so beginners can start by learning and mastering the previous methods, and then learning new methods so as to reduce the amount of calculation and further improve performance when facing appropriate scenarios in practice.

Before reading this article, it is recommended to read the previous two articles to understand the classification of joins in SPL and familiarize yourself with the pattern of selecting algorithm for different join types. All such knowledge will be involved in this article and will not be further explained.

## Foreign key join

1. Unification of foreign key addressization and foreign key sequence-numberization

In the previous two articles, we introduced the foreign key addressization, which is to convert the foreign key field of fact table to the in-memory address of dimension table record, while the foreign key sequence-numberization is to convert the foreign key field of fact table to the position sequence number of dimension table record and obtain the dimension table record through sequence number during associating.

However, the two functions switch/join introduced previously are not unified in syntax in implementing foreign key addressization and foreign key sequence-numberization. Let’s review it through an example.

Take an order table (fact table) and an employee table (dimension table) as an example, and the two tables are associated on the employee number. Now we want to filter the order data by employee name and order date.

In order to clearly distinguish, we set the employee number of the order table as o_eid, and the employee number of the employee table as eid.

The code of using the switch function to implement foreign key addressization is roughly as follows:

If the field ‘num_eid’ of order table stores the position sequence number of employee table, the code of using switch to implement foreign key sequence-numberization is:

A2: the switch function uses a symbol ‘#’, which stands for sequence number association. In other words, the switch function needs to rely on special symbol to implement sequence number association, resulting in inconsistent syntax between foreign key sequence-numberization and foreign key addressization.

Now let’s take a look at the code of using the join function to implement foreign key addressization:

A2: join creates a new table. Besides the fact table field, a new field ‘fk_eid’ is joined to store the address of dimension table record, hereby implementing foreign key addressification. The symbol ‘~’ represents the dimension table record that can be associated.

In fact, the most common usage of join is to join the dimension table field ‘name’ onto the fact table, and the ‘name’ can be used directly in subsequent calculations, and ‘~:fk_eid’ can be omitted.

The code of using the join function to implement foreign key sequence-numberization is as follows:

The join function in A2 also uses a special symbol ‘#’, which means that join uses different syntax to implement foreign key addressization and foreign key sequence-numberization. This code omits ~:fk_eid and join the name field directly.

The fjoin function can implement syntax unification of foreign key addressization and foreign key sequence-numberization. Let’s first look at the code to implement foreign key addressization:

A1: read the order table into memory. B1: read the employee table into memory and define the primary key with index as eid;

A2: create a new table based on A1, and execute the expression B1.find(o_eid) on each row of A1, returning the record address of employee table; ~:fk_eid,name means to join employee table’s record address and name field onto the new table; the symbol ‘~’ represents the return value of expression.

Like the join function, fjoin recommends joining the dimension table data onto the fact table, so ~:fk_eid can also be omitted.

Now let’s look at the code of using fjoin to implement foreign key sequence-numberization:

In A2, the syntax is almost exactly the same as that for addressization except the expression is modified as taking the dimension table record by sequence number position.

From the examples above, it can be seen that fjoin covers the functionality of join function and can directly join dimension table record onto the fact table, which is a common usage of fjoin.

In addition, fjoin also covers the functionality of switch function, and implements foreign key addressization and foreign key sequence-numberization by directly assigning value to foreign key field. Let’s first look at the code to implement addressization:

A2: the expression finds the address of dimension table record by primary key and assigns it to the foreign key field of fact table, hereby implementing foreign key addressization.

Now let’s see the code to implement sequence-numberization:

A2: the expression finds the address of dimension table record by sequence number and assigns it to the foreign key field of fact table, hereby implementing sequence-numberization.

From the examples above, it can be seen that fjoin not only unifies foreign key addressization and foreign key sequence-numberization in syntax, but also covers the functionality of switch/join function.

However, there are still some differences between fjoin and switch/join.

First, switch will automatically judge whether there is an index on primary key of dimension table. If there is, the index will be used directly; if not, switch will create an index and manage by itself. By contrast, instead of providing this mechanism, fjoin essentially uses the find function to do assignment calculation, so the dimension table needs to create index on the primary key itself.

Second, switch directly modifies the original table, while fjoin generates a new table.

Third, fjoin supports multi-thread parallel computing, which can be implemented by just adding add @m to fjoin in the examples above, while switch and join don’t support multi-thread parallel computing.

2. Flexible use of expression

The biggest advantage of fjoin is to use expression to implement association, and the use of expression is very flexible.

For example, using fjoin expression can easily implement multi-field foreign key association, and the only thing we need to do is to change the expression to multi-field association.

Assume the primary key of a class table includes two fields: major number ‘mid’ and class number ‘cid’, now we want to associate the primary key of the class table with the ‘s_mid’ and ‘s_cid’ fields of the student table. The code is:

Using expressions can also implement extremely flexible foreign key association.

Assume the primary key ‘classid’ of the class table is composed of two integers: major number ‘mid’ and class number ‘cid’; the first integer is mid and the second two-digit integer is cid, that is classid=mid*100+cid. The student table uses two fields to store s_mid and s_cid respectively.

Now we want to implement foreign key association between the two tables, the code is:

In A2, using fjoin expression can easily implement such complex association.

3. The @i option of fjoin

The previous two articles present the @i option of switch/join function, which is used to retain only the record that the fact table can associate. The @d option, on the contrary, is used to discard the record that the fact table can associate.

fjoin also provides the @i option. If the return value of expression is false/null, the @i option will filter the corresponding record. For example, associate the order table with two dimension tables (employee table and customer table), and discard the order that cannot be associated. Now we want to group the orders by customer city ‘city’ and employee department ‘dept’ and aggregate the number and amount of orders of each group, the code is roughly as follows:

A3: fjoin is appended with @i, and there are two expressions separated by semicolon;

The first expression is to search for the record corresponding to the customer number in customer table; the return value will be null if a record cannot be associated.

The second expression is to search for the record corresponding to the employee number in employee table, and the return value will be null if a record cannot be associated.

The two expressions are in “and” relationship, which means that when none of the return values is null, the final value is true, or it is false.

fjoin retains the order record whose final value is true, and joins the city and dept fields of two tables at the same time.

A4: group and aggregate as required.

If we want to obtain result similar to @d, just add ‘!’ before the expression, so fjoin does not need @d option.

Still, let’s take the association between the order table and the two dimension tables as an example, now we only want to retain the order that cannot be associated with the two dimension tables, the code is:

A1, B1: since the two dimension tables are only used to filter the order table, irrelevant fields are no longer read;

A3: fjoin still uses two expressions separated by semicolon;

The first expression is to search for the record corresponding to the customer number in customer table. After adding ‘!’, the return value will be null if a record is associated.

The second expression is to search for the record corresponding to the employee number in employee table. After adding ‘!’, the return value will be null if a record is associated.

The two expressions are in “and” relationship, which means that when none of the return values is null, the final value is true, or it is false.

fjoin retains the order record whose final value is true.

A4: only the order that cannot be associated is retained at this point, so subsequent calculations can only use the field of the order table itself.

When a fact table is associated with multiple dimension tables, there may be a situation where some dimension tables need to use @i, while others need to use @d. Since the switch/join function can only use one option @i or @d at the same time, it is impossible to meet this requirement.

fjoin expression is very flexible in coding, making it easy to implement this requirement.

We still take the association between an order table and the two dimension tables as an example, now we want to find the order that can be associated with the customer table but cannot be associated with the employee table.

A3: fjoin still uses two expressions separated by semicolon;

The first expression is to search for the record corresponding to the customer number in customer table; the return value will be null if a record cannot be associated.

The second expression is to search for the record corresponding to the employee number in employee table. After adding ‘!’, the return value will be null if a record is associated.

The two expressions are in “and” relationship, which means that when none of the return values is null, the final value is true, or it is false.

fjoin retains the order record whose final value is true.

In the example above, multiple association and filtering conditions are in ‘and’ relationship. If a record does not meet any of the conditions, it will be filtered out. But sometimes, the ‘or’ relationship is also needed.

Suppose we want to associate three fields of an order table with three different customer tables. The primary key field of the three tables is, in turn, customer number ‘cid’, customer name ‘cname’ and contact ‘ccontact’. If a certain order record cannot be associated with any of the three customer tables, it will be discarded.

The code of using fjoin to implement this requirement is:

A5: the first three expressions are responsible for assigning value for association; the last expression is the filter condition; the ‘or’ calculation implements this requirement.

If using switch or join, it would need to associate all three foreign keys and then use the select function to filter out the order record that cannot be associated with any dimension table.

4. Composite table cursor association and filtering

fjoin@i means that the fact table not only needs to be associated but also needs to be filtered. If the fact table is a composite table cursor, association can be implemented during pre-cursor filtering. This method can achieve better performance and is called the composite table cursor association and filtering mechanism.

The code of using this mechanism to implement the situation described in the previous section is:

A2: the expression in the original fjoin@i is now executed during pre-cursor filtering.

The pre-cursor association and filtering mechanism is equivalent to forcibly adding @i to fjoin. If we only want to associate and don’t want to filter, the fjoin without option is still used in general.

The example of three customer tables in the previous section can also be implemented using this mechanism:

A4: write the expression of the original fjoin@i to the pre-cursor filtering condition.

The composite table cursor association and filtering mechanism is the same as fjoin@i in syntax, but it can achieve better computing performance. This is because the expression written in the pre-cursor filter condition of composite table is equivalent to forcibly adding the @i option, and its calculation is performed before the generation of cursor record. If the return value is null/false, the order record will not be generated, which can reduce the time of cursor to fetch data and generate record, and improve the performance.

In contrast, if using fjoin directly, the association and filtering will be performed after the generation of cursor record. Even if one order record does not meet condition, it is already generated.

Therefore, we can say that the composite table cursor association and filtering mechanism is the most common and recommended method for foreign key association of big data, while fjoin@i on the cursor is less used.

Note that the said mechanism is only applicable to composite table. Other types of cursors (bin file, text file, database, etc.) do not support pre-cursor filtering and cannot implement this mechanism.

## Primary key join

1. The pjoin function can implement the basic functionality of join/joinx.

For example, the employee table and manager table are associated on their respective primary keys, the code to calculate the sum of salary and allowance is roughly as follows:

A2: pjoin must take one table as the base table. Here it takes the table in A1 as base table.

If there is cursor in tables involved in pjoin association, it requires that any table participating in calculation should be in order by respective association fields, which is consistent with the requirement of joinx.

Similar to join/joinx, pjoin also supports inner join, left join, and full join.

The result obtained in the above example is the result of inner join. In addition, there are often many scenarios that require left join, which is coded roughly as follows:

In A2, B1:null means that any employee record that cannot be associated with manager table will be filled with null in the manager table field of association result.

If we only want to retain the employee record that cannot be associated, the code is:

In A2, only the primary key field is written after B1:null. In this case, the result will retain only the employee record that cannot be associated, and the fields of manager table will not appear in result set.

Since pjoin does not directly support right join, we need to change the right join to left join. For example, assume the employee and the manager are associated through right join, we need to change it to left join between the manager and the employee and then use pjoin to implement association.

The code of using pjoin to implement full join is roughly as follows:

The records of both tables will appear in the association result, regardless of whether they can be associated. For record that can be associated, the result will contain fields of the two tables; for record that cannot be associated, the correspond field of the other table will be filled with null.

In this case, even if we write B1:null, null will be ignored and we can still obtain the result of full join.

There are two main differences between pjoin and join/joinx.

One difference is that join/joinx will take the first table as base table in the case of left join, yet it doesn’t need a base table for inner join and full join, and all tables involved in association have the same status.

In contrast, pjoin requires a base table for any of these three joins.

The other difference is that the result of join/joinx is the records composed of records corresponding to each table.

On the contrary, pjoin advocates joining the field of each table to form result. For example, the calculation in A2 will obtain a result set consisting of four fields: id, name, salary, and allowance. The first three are the fields of employee table, and the last one comes from manager table.

Designed like this, pjoin can effectively improve performance, which will be introduced in detail later.

2. Associate multiple tables on primary key, involving left join and inner join

Assume we add one salary table to store the employee number ‘sid’ and the salary besides the employee table and manager table. Now we want to calculate the sum of salary and allowance, then only the records that can be associated between the employee table and salary table can participate in calculation, and the employee table should be taken as the base table when associating the employee table with manager table.

In other words, the association between employee and salary is inner join, and the association between employee and manager is left join. The code is roughly as follows:

The join/joinx function cannot write operations like those in A4.

3. Use the new/news function of composite table to implement primary key join

The new function can implement homo-dimension join. Take the employee table and manager table as an example:

A1: open the composite table (note that it is not a composite table cursor); B1: open the manager table and create a cursor.

A2: associate the primary key ‘eid’ of employee table with the first field ‘mid’ of manager table. Both tables should be ordered by respective association fields.

In addition, the new function can also be used for primary-sub join. Take the order table and order detail table as an example. The primary key of order table is the order number ‘oid’ and the primary keys of order detail table are the ‘d_oid’ and product number ‘pid’. The code to associate two tables is:

A2: associate the primary key ‘oid’ in A1 with the first field ‘d_oid’ in B1. Both tables should be ordered by respective association fields.

Since one order in A2 will correspond to multiple detail records, the new function requires performing an aggregation operation on these detail records.

When B1 is a cursor, the new function returns cursor; when B1 is a table sequence, the new function returns table sequence.

The new function can be appended with the @r option. In this case, there is no need to aggregate the detail records, and the order record will be duplicated multiple pieces according to the number of the corresponding detail records.

For the situation that sub table associates with primary table, we can use the news function, the code is:

A2: the news function uses the primary key ‘d_oid’ in B1 to correspond to the first field ‘oid’ of order table. Both tables should be ordered by respective association fields.

When multiple records of the detail table correspond to one record of order table, news will duplicate the order record.

The news function can be appended with the @r option. In this case, the detail table needs to be aggregated before associating, and the order record will not be duplicated.

When A1 is a cursor, the news function returns cursor; when A1 is a table sequence, the new function returns table sequence.

Using the new/news function to implement primary key join has advantages in performance because the composite table will automatically skip some data blocks along with the cursor (or table sequence) when calculating with the two functions.

For example, in the above example of new function, the function will fetch a batch of detail records first, and then fetch the order record based on the value range of d_oid of these detail records, which is equivalent to adding a filter condition to the order table and, since the composite table is ordered by primary key, some data blocks will be automatically skipped when involving filtering by primary key (if a certain data block does not have the primary key that meets the filter condition, the entire block will be skipped), thereby improving computing performance. Especially when the order detail records (after filtering) become very few, the block skipping effect is stronger and the performance improvement is more obvious.

4. pjoin supports aggregating the records of sub table and duplicating the record of primary table

pjoin covers and expands the application scope of new/news function of composite table. In addition to the composite table, pjoin can also be used for table sequence, or other types of cursors.

Let’s first take a look at the situation that primary table associates with sub table. In this case, there is a need to use the aggregation function to aggregate multiple records of sub table before associating.

For example, when associating the order table with order detail table, the code of using pjoin to calculate the amount of each order is roughly as follows:

A2: pjoin uses the primary key ‘oid’ of order table in A1 to associate part of primary keys ‘d_oid’ of order detail table. As one of the tables is cursor, any involved table needs to be ordered by respective association fields.

First multiply the price of each record of order detail table by quantity, then group and aggregate the products by d_oid, and finally associate the grouped and aggregated result with order table and join the fields.

pjoin also supports associating the sub table with primary table. For example, we want to find the corresponding customer number for each record of order detail table, the code is roughly as follows:

A2: use part of primary keys ‘d_oid’ of detail table to correspond to the first field ‘oid’ of order table.

Since one order record corresponds to multiple detail records, pjoin will duplicate the order record multiple times and then join them onto the field of detail records.

5. pjoin’s automatic block skipping mechanism

If there is composite table cursor in tables involved in pjoin calculation, pjoin can also implement the block skipping mechanism similar to new/news, hereby effectively improving performance. Assume both the order table and order detail table are composite table cursor:

When calculating a join with pjoin, the detail table cursor in B1 will automatically skip some data blocks along with the order table cursor in A1, which is the same as the block skipping mechanism of new/news described earlier.

The order table in A1 can also be a table sequence. When calculating with pjoin, the detail table cursor can still implement automatic data block skipping.

In some cases, the order detail table (after filtering) may be smaller than order table, which requires the order composite table cursor to skip data block along with the detail table. The pjoin code is roughly as follows:

In A2, the @r option is added, which means the cursor in A1 will skip data block along with B1. In this example, the table in B1 is a cursor, which can be a table sequence in practice.

Note that when adding the @f option to perform full join, pjoin cannot implement automatic block skipping mechanism.

6. Columnar cursor

pjoin supports the columnar cursor mechanism for pure table sequence of SPL Enterprise Edition, the code is:

Adding the @v option to cursors in A1 and B1 is to use columnar cursor.

Since join/joinx does not necessarily use the base table, it is difficult to determine whether the result is columnar. As mentioned earlier, pjoin must use a base table, so we can decide whether to adopt column-wise computing based on the base table when calculating.

The practice that join/joinx utilizes the records of each table to form result has performance advantages in row-wise calculation, but will affect the performance in column-wise calc