# Comparison of SQL & SPL: Set-oriented Operations

Set-oriented operations include a set of basic operations, including concatenation, intersection, union, difference, XOR, and inclusion. As structured data often exists in the form of sets, both SQL and SPL, the two commonly used programming languages, provide a wealth of set-oriented methods and functions to handle the data. This essay explores solutions and basic principles of the two languages for handling set-oriented operations through examples and corresponding sample programs and tries to introduce the more convenient and more efficient way for you.

**1. ****C****oncatenation**

Concatenation of set A and set B is a set including all members of set A and all members of set B. It contains duplicate members.

【Example 1】A vendor has two sales channels, online and store. The sales data for them is stored in two separate tables, SALES_ONLINE and SALES_STORE. The task is to calculate the total sales amount of each product in the year 2014. Below are parts of the two source tables:

SALES_ONLINE：

ID | CUSTOMERID | ORDERDATE | PRODUCTID | AMOUNT |

1 | HANAR | 2014/01/01 | 17 | 3868.6 |

2 | ERNSH | 2014/01/03 | 42 | 1005.9 |

3 | LINOD | 2014/01/06 | 64 | 400.0 |

4 | OTTIK | 2014/01/07 | 10 | 1194.0 |

5 | OCEAN | 2014/01/09 | 41 | 319.2 |

… | … | … | … | … |

SALES_STORE：

ID | CUSTOMERID | ORDERDATE | SELLERID | PRODUCTID | AMOUNT |

1 | EASTC | 2014/01/01 | 1 | 27 | 3063.0 |

2 | ERNSH | 2014/01/02 | 8 | 70 | 2713.5 |

3 | MAGAA | 2014/01/03 | 2 | 74 | 1675.0 |

4 | SAVEA | 2014/01/07 | 7 | 45 | 2018.2 |

5 | FOLIG | 2014/01/08 | 8 | 30 | 1622.4 |

… | … | … | … | … | … |

The two tables have different structures. SALES_STORE has SELLERID field for storing IDs of salespeople. But both have ID, CUSTOMERID, ORDERDATE, PRODUCTID and AMOUNT fields.

**SQL solution:**

select PRODUCTID, sum(AMOUNT) AMOUNT from (select PRODUCTID, AMOUNT from SALES_ONLINE where extract (year from ORDERDATE)=2014 union all select PRODUCTID, AMOUNT from SALES_STORE where extract (year from ORDERDATE)=2014) group by PRODUCTID order by PRODUCTID

SQL UNION operation is used to combine two or more sub result sets of same structure. They need to have same number of columns and each column should have the same data type. UNION ALL combines all records, including the duplicates.

Here we use ORCALE to do the SQL calculations. Since ORACLE does not have YEAR function, we use extract (year from date) to get the year.

**SPL solution:**

SPL uses the vertical line “|” to calculate concatenation of sets. A|B represents the concatenation of set A and set B.

A | |

1 | =T("SalesOnline.csv").select(year(ORDERDATE)==2014) |

2 | =T("SalesStore.csv").select(year(ORDERDATE)==2014) |

3 | =A1|A2 |

4 | =A3.groups(PRODUCTID; sum(AMOUNT):AMOUNT) |

**A1**: Import SalesOnline table from the source file and select sales records of 2014.**A2**: Import SalesStore table from the source file and select sales records of 2014.**A3**: Use the sign “|” to calculate concatenation of the two sets.**A4**: Group A3’s result set by product and calculate the total sales amount.

The SPL sequence supports members of different structures, so it does not require that the involved sets have same structure when concatenate them. It can directly access their common fields, like PRODUCTID and AMOUNT in this case, as it accesses an ordinary data table. It’s superbly convenient.

SPL supports retrieving a data table from the database, we can change A1 in the above script as follows:

A | |

1 | =connect("db").query("select * from SALES_STORE where extract (year from ORDERDATE)=2014") |

【Example 2】Based on the following scores table, find the top 4 math scores, top 3 English scores and top 2 PE scores. Below is part of the source table:

CLASS | STUDENTID | SUBJECT | SCORE |

1 | 1 | English | 84 |

1 | 1 | Math | 77 |

1 | 1 | PE | 69 |

1 | 2 | English | 81 |

1 | 2 | Math | 80 |

… | … | … | … |

**SQL solution:**

SQL lacks a special method of calculating concatenation of two or more sets. It uses UNION ALL operator to do this. Oracle, for instance, defines an open table, uses ROW_NUMBER() OVER to calculate the rankings of all subjects ordered by scores in descending order, and then get top N for each subject:

with cte1 as (select CLASS, STUDENTID, SUBJECT, SCORE, ROW_NUMBER() OVER(PARTITION BY SUBJECT ORDER BY SCORE DESC) grouprank from SCORES) select CLASS, STUDENTID, SUBJECT, SCORE from cte1 where grouprank <= 4 and SUBJECT='Math' UNION ALL select CLASS, STUDENTID, SUBJECT, SCORE from cte1 where grouprank <= 3 and SUBJECT='English' UNION ALL select CLASS, STUDENTID, SUBJECT, SCORE from cte1 where grouprank <= 2 and SUBJECT='PE'

**SPL solution:**

SPL uses A.conj() function to calculate the concatenation of member sets when A is a set of sets.

A | |

1 | =T("Scores.csv") |

2 | [Math,English,PE] |

3 | [4,3,2] |

4 | =A3.conj(A1.select(SUBJECT==A2(A3.#)).top(-~;SCORE)) |

**A1**: Import student scores table.**A2**: Define a set of subjects.**A3**: Define a set of ranks corresponding to the set of subjects.**A4**: Get top N scores for each specified subject, and use A.conj() function to concatenate multiple sets.

As there isn’t a special method of calculating concatenation of sets, we use UNION ALL to combine two sets each time. The code becomes more and more complicated when the number of sets involved increases. SPL has the special function A.conj() to do the job. It can calculate concatenation of any number of sets.

**2. ****I****ntersection**

The intersection of set A and set B is a set including all members belonging to both A and B.

【Example 3】A vendor has two sales channels, online and store. The sales data for them is stored in two separate tables, SALES_ONLINE and SALES_STORE. The task is to find the customers who purchase through both sales channels in the year 2014. Below is part of the source table:

SALES_ONLINE：

ID | CUSTOMERID | ORDERDATE | PRODUCTID | AMOUNT |

1 | HANAR | 2014/01/01 | 17 | 3868.6 |

2 | ERNSH | 2014/01/03 | 42 | 1005.9 |

3 | LINOD | 2014/01/06 | 64 | 400.0 |

4 | OTTIK | 2014/01/07 | 10 | 1194.0 |

5 | OCEAN | 2014/01/09 | 41 | 319.2 |

… | … | … | … | … |

SALES_STORE：

ID | CUSTOMERID | ORDERDATE | SELLERID | PRODUCTID | AMOUNT |

1 | EASTC | 2014/01/01 | 1 | 27 | 3063.0 |

2 | ERNSH | 2014/01/02 | 8 | 70 | 2713.5 |

3 | MAGAA | 2014/01/03 | 2 | 74 | 1675.0 |

4 | SAVEA | 2014/01/07 | 7 | 45 | 2018.2 |

5 | FOLIG | 2014/01/08 | 8 | 30 | 1622.4 |

… | … | … | … | … | … |

**SQL solution:**

SQL INTERSECT operator is used to calculate the intersection of two or more sub result sets. The sub result sets must have same data structure and same number of columns, and their columns should contain data of same types. Below is SQL query:

select DISTINCT CUSTOMERID from SALES_ONLINE where extract (year from ORDERDATE)=2014 INTERSECT select DISTINCT CUSTOMERID from SALES_STORE where extract (year from ORDERDATE)=2014

Early databases probably do not support INTERSECT operator. So we can first calculate concatenation (UNION ALL) and then perform a filtering to get the target through grouping operation and count operation:

select CUSTOMERID, COUNT(*) from (select DISTINCT CUSTOMERID from SALES_ONLINE where extract (year from ORDERDATE)=2014 UNION ALL select DISTINCT CUSTOMERID from SALES_STORE where extract (year from ORDERDATE)=2014) group by CUSTOMERID HAVING COUNT(*)>1 order by CUSTOMERID

The additional layer of grouping and aggregation makes the SQL query extremely complicated.

**SPL solution:**

SPL uses the sign “^” to calculate intersection of sets. A^B represents the intersection of set A and set B.

A | |

1 | =T("SalesOnline.csv").select(year(ORDERDATE)==2014) |

2 | =T("SalesStore.csv").select(year(ORDERDATE)==2014) |

3 | =A1.id(CUSTOMERID)^A2.id(CUSTOMERID) |

**A1**: Import SalesOnline table from the source file and select sales records of 2014.**A2**: Import SalesStore table from the source file and select sales records of 2014.**A3**: A.id() function gets the set of unique customer IDs. The sign “^” is used to calculate intersection of the two sets, which are customers who purchase products in both in online and at stores.

【Example 4】Based on the following sales data table, find top 10 customers in terms of monthly sales amount in the year 2014. Below is part of the source table:

ID | CUSTOMERID | ORDERDATE | SELLERID | PRODUCTID | AMOUNT |

10400 | EASTC | 2014/01/01 | 1 | 27 | 3063.0 |

10401 | HANAR | 2014/01/01 | 1 | 17 | 3868.6 |

10402 | ERNSH | 2014/01/02 | 8 | 70 | 2713.5 |

10403 | ERNSH | 2014/01/03 | 4 | 42 | 1005.9 |

10404 | MAGAA | 2014/01/03 | 2 | 74 | 1675.0 |

… | … | … | … | … | … |

**SQL solution:**

with cte1 as (select extract (month from ORDERDATE) ORDERMONTH, CUSTOMERID, SUM(AMOUNT) AMOUNT from SALES where extract (year from ORDERDATE)=2014 group by extract (month from ORDERDATE),CUSTOMERID order by ORDERMONTH ASC, AMOUNT DESC), cte2 as (select ORDERMONTH,CUSTOMERID,AMOUNT, ROW_NUMBER()OVER(PARTITION BY ORDERMONTH ORDER BY AMOUNT DESC) grouprank from cte1) select CUSTOMERID,count(*) from cte2 where grouprank<=10 group by CUSTOMERID having count(*)=12

SQL has not a method of calculating intersection of sets. The language uses INTERSECT operator to do the job. It is unsuitable to get top 10 customers in each month and then use INTERSECT operator to get the combination. Here we group records and perform a count operation. If a customer ranking in monthly top 10 for 12 times, it enters in the top 10 in each month.

**SPL solution:**

SPL uses A.isect() function to calculate intersection of all member sets if A is a set of sets.

A | |

1 | =T("Sales.csv").select(year(ORDERDATE)==2014) |

2 | =A1.group(month(ORDERDATE)) |

3 | =A2.(~.groups(CUSTOMERID;sum(AMOUNT):AMOUNT)) |

4 | =A3.(~.top(-10;AMOUNT).(CUSTOMERID)) |

5 | =A4.isect() |

**A1**: Import Sales table from the source file and select records of the year 2014.**A2**: Group A1’s table by month.**A3**: Group records in each month by customer.**A4**: Get customers whose monthly sales amounts rank in top 10 in each month.**A5**: A.isect() function calculates intersection of all selected customers in all months.

Though we employ a trick to calculate the intersection, the SQL query is still complex. One reason is that SQL does not have a special method to calculate intersection of sets, the other is that SQL cannot retain the post-grouping subsets for a further grouping operation or other computations. The grouping operation will be explained separately later.

**3. ****U****nion**

The union of set A and set B is a set including all members of A and members of B, except for the duplicates.

【Example 5】A vendor has two sales channels, online and store. The sales data for them is stored in two separate tables, SALES_ONLINE and SALES_STORE. The task is to find the products whose total online sales amounts in the year 2014 is above 10,000 or that are purchased over 5 times at stores. Below are parts of the source tables:

SALES_ONLINE：

ID | CUSTOMERID | ORDERDATE | PRODUCTID | AMOUNT |

1 | HANAR | 2014/01/01 | 17 | 3868.6 |

2 | ERNSH | 2014/01/03 | 42 | 1005.9 |

3 | LINOD | 2014/01/06 | 64 | 400.0 |

4 | OTTIK | 2014/01/07 | 10 | 1194.0 |

5 | OCEAN | 2014/01/09 | 41 | 319.2 |

… | … | … | … | … |

SALES_STORE：

ID | CUSTOMERID | ORDERDATE | SELLERID | PRODUCTID | AMOUNT |

1 | EASTC | 2014/01/01 | 1 | 27 | 3063.0 |

2 | ERNSH | 2014/01/02 | 8 | 70 | 2713.5 |

3 | MAGAA | 2014/01/03 | 2 | 74 | 1675.0 |

4 | SAVEA | 2014/01/07 | 7 | 45 | 2018.2 |

5 | FOLIG | 2014/01/08 | 8 | 30 | 1622.4 |

… | … | … | … | … | … |

**SQL solution:**

select PRODUCTID from (select PRODUCTID, sum(AMOUNT) from SALES_ONLINE where extract (year from ORDERDATE)=2014 group by PRODUCTID having sum(AMOUNT)>10000) union select PRODUCTID from (select PRODUCTID, count(*) from SALES_STORE where extract (year from ORDERDATE)=2014 group by PRODUCTID having count(*)>5)

As we have mentioned, SQL UNION operator is used to combine results sets of two or more SELECT statements. When UNION is used without ALL, the operator removes duplicate records while combining records.

**SPL solution:**

SPL offers the and sign “&” to calculate the union of sets. A&B represents the union of set A and set B.

A | |

1 | =T("SalesOnline.csv").select(year(ORDERDATE)==2014) |

2 | =T("SalesStore.csv").select(year(ORDERDATE)==2014) |

3 | =A1.groups(PRODUCTID; sum(AMOUNT):AMOUNT).select(AMOUNT>10000) |

4 | =A2.groups(PRODUCTID; count(~):COUNT).select(COUNT>5) |

5 | =A3.(PRODUCTID)&A4.(PRODUCTID) |

**A1**: Import SalesOnline table from the source file and select records of the year 2014.**A2**: Import SalesStore table from the source file and select records of the year 2014.**A3**: Group A1’s records by product ID, calculate the total sales amount for each product, and select records where the totals are above 10,000.**A4**: Group A2’s records by product ID, count purchase frequencies for each product, and select records where the frequency count is greater than 5.**A5**: Use “&” to calculate the union of records purchased both through online and at stores.

【Example 6】Based on the following sales table, find the products whose yearly purchase frequency ranks in top 10. Below is part of the source table:

ID | CUSTOMERID | ORDERDATE | SELLERID | PRODUCTID | AMOUNT |

10400 | EASTC | 2014/01/01 | 1 | 27 | 3063.0 |

10401 | HANAR | 2014/01/01 | 1 | 17 | 3868.6 |

10402 | ERNSH | 2014/01/02 | 8 | 70 | 2713.5 |

10403 | ERNSH | 2014/01/03 | 4 | 42 | 1005.9 |

10404 | MAGAA | 2014/01/03 | 2 | 74 | 1675.0 |

… | … | … | … | … | … |

**SQL solution:**

with cte1 as (select extract (year from ORDERDATE) ORDERYEAR, PRODUCTID, COUNT(*) ORDERCOUNT from SALES group by extract (year from ORDERDATE),PRODUCTID order by ORDERYEAR ASC, ORDERCOUNT DESC), cte2 as (select ORDERYEAR,PRODUCTID,ORDERCOUNT, ROW_NUMBER()OVER(PARTITION BY ORDERYEAR ORDER BY ORDERCOUNT DESC) grouprank from cte1) select DISTINCT PRODUCTID from cte2 where grouprank<=10

SQL does not offer a special method of calculating union of sets, so we group data by year, calculate the total purchase frequency of each product per year, sort the result records, get rankings through row numbers after sorting, then select IDs of products that rank in top 10 in each year in terms of purchase frequency, and finally, use DISTINCT operator to remove the duplicates to get the union.

**SPL solution:**

SPL uses A.union() function to calculate union of all member sets when A is a set of sets.

A | |

1 | =T("Sales.csv") |

2 | =A1.group(year(ORDERDATE)) |

3 | =A2.(~.groups(PRODUCTID;count(~):COUNT)) |

4 | =A3.(~.top(-5;COUNT).(PRODUCTID)) |

5 | =A4.union() |

**A1**: Import Sales table from the source file.**A2**: Group A1’s table by year.**A3**: Group records of each year by product and calculate their purchase frequency.**A4**: Get IDs of products whose yearly purchase frequencies rank in top 10.**A5**: A.union() function calculates union of desired records of products in all years.

**4. ****D****ifference**

The difference of set A and set B is a set including all members of A that do not belong to set B.

【Example 7】A vendor has two sales channels, online and store. The sales data for them is stored in two separate tables, SALES_ONLINE and SALES_STORE. The task is to find customers whose total purchase amount at stores is above 1000 but who have not any online purchase records. Below is part of the source table:

SALES_ONLINE：

ID | CUSTOMERID | ORDERDATE | PRODUCTID | AMOUNT |

1 | HANAR | 2014/01/01 | 17 | 3868.6 |

2 | ERNSH | 2014/01/03 | 42 | 1005.9 |

3 | LINOD | 2014/01/06 | 64 | 400.0 |

4 | OTTIK | 2014/01/07 | 10 | 1194.0 |

5 | OCEAN | 2014/01/09 | 41 | 319.2 |

… | … | … | … | … |

SALES_STORE：

ID | CUSTOMERID | ORDERDATE | SELLERID | PRODUCTID | AMOUNT |

1 | EASTC | 2014/01/01 | 1 | 27 | 3063.0 |

2 | ERNSH | 2014/01/02 | 8 | 70 | 2713.5 |

3 | MAGAA | 2014/01/03 | 2 | 74 | 1675.0 |

4 | SAVEA | 2014/01/07 | 7 | 45 | 2018.2 |

5 | FOLIG | 2014/01/08 | 8 | 30 | 1622.4 |

… | … | … | … | … | … |

**SQL solution:**

select CUSTOMERID from (select CUSTOMERID,count(*) from SALES_STORE where extract (year from ORDERDATE)=2014 group by CUSTOMERID having count(*)>3) MINUS select DISTINCT CUSTOMERID from SALES_ONLINE where extract (year from ORDERDATE)=2014

Take Oracle SQL as an example. The MINUS operator (some databases use EXCEPT) is used to calculate difference. MINUS requires that the involved sub result sets must have same data structures and same number of columns, and that corresponding columns should have same or similar data types.

**SPL solution:**

SPL uses the backslash “\” to calculate difference of sets. A\B represents the difference of set A and set B.

A | |

1 | =T("SalesOnline.csv").select(year(ORDERDATE)==2014) |

2 | =T("SalesStore.csv").select(year(ORDERDATE)==2014) |

3 | =A2.groups(CUSTOMERID; count(~):COUNT).select(COUNT>3) |

4 | =A3.id(CUSTOMERID)\A1.id(CUSTOMERID) |

**A1**: Import SalesOnline table from the source file and select records of the year 2014.**A2**: Import SalesStore table from the source file and select records of the year 2014.**A3**: Group A2’s records by customer IDs, calculate the total purchase frequency for each customer, and select records where the total frequency is greater than 3.**A4**: Use “\” to calculate difference of customers who purchase products online and those who buy things at stores.

【Example 8】Based on the following sales table, find customers whose total sales amounts rank in top 10 only in January in the year 2014. Below is part of the source table:

ID | CUSTOMERID | ORDERDATE | SELLERID | PRODUCTID | AMOUNT |

10400 | EASTC | 2014/01/01 | 1 | 27 | 3063.0 |

10401 | HANAR | 2014/01/01 | 1 | 17 | 3868.6 |

10402 | ERNSH | 2014/01/02 | 8 | 70 | 2713.5 |

10403 | ERNSH | 2014/01/03 | 4 | 42 | 1005.9 |

10404 | MAGAA | 2014/01/03 | 2 | 74 | 1675.0 |

… | … | … | … | … | … |

**SQL solution:**

with cte1 as (select extract (month from ORDERDATE) ORDERMONTH, CUSTOMERID, SUM(AMOUNT) AMOUNT from SALES where extract (year from ORDERDATE)=2014 group by extract (month from ORDERDATE),CUSTOMERID order by ORDERMONTH ASC, AMOUNT DESC), cte2 as (select ORDERMONTH,CUSTOMERID,AMOUNT, ROW_NUMBER()OVER(PARTITION BY ORDERMONTH ORDER BY AMOUNT DESC) grouprank from cte1) select CUSTOMERID from cte2 where grouprank<=10 and ORDERMONTH=1 MINUS select CUSTOMERID from cte2 where grouprank<=10 and ORDERMONTH<>1

SQL does not have a particular method for calculating difference of sets, so we group data by month, calculate the total sales amount of each customer per month, sort the result records, get rankings through row numbers after sorting, select customers of January, and use MINUS operator to calculate difference January’s customers and the set of customers in other months.

**SPL solution:**

SPL offers A.diff() function to calculate difference of the first member set and all the other member sets when A is a set of sets.

A | |

1 | =T("Sales.csv").select(year(ORDERDATE)==2014) |

2 | =A1.group(month(ORDERDATE)) |

3 | =A2.(~.groups(CUSTOMERID;sum(AMOUNT):AMOUNT)) |

4 | =A3.(~.top(-10;AMOUNT).(CUSTOMERID)) |

5 | =A4.diff() |

**A1**: Import Sales table from the source file and select records of the year 2014.**A2**: Group A1’s records by month.**A3**: Group records of each month by customer IDs and calculate total sales of each customer.**A4**: Get customers in each month whose total sales amounts rank in top 10.**A5**: Use A.diff() function to calculate difference between customers of January and those in the other months.

**5. ****X****OR**

XOR of set A and set B is a set made up members that belong to A or B but not belong to both.

【Example 9】Students’ scores in two semesters are stored in two tables respectively. The task is to find students in class 1 whose total scores rank in top 10 only once in both semesters. Below is part of the source table:

SCORES1_SEMESTER1:

CLASS | STUDENTID | SUBJECT | SCORE |

1 | 1 | English | 84 |

1 | 1 | Math | 77 |

1 | 1 | PE | 69 |

1 | 2 | English | 81 |

1 | 2 | Math | 80 |

… | … | … | … |

SCORES2_SEMESTER2:

CLASS | STUDENTID | SUBJECT | SCORE |

1 | 1 | English | 97 |

1 | 1 | Math | 64 |

1 | 1 | PE | 97 |

1 | 2 | English | 56 |

1 | 2 | Math | 82 |

… | … | … | … |

**SQL solution:**

SQL does not define an operator for performing XOR. There are two methods if we choose to use set operators to do this:

1. (A UNION B) EXCEPT (A INTERSECT B);

2. (A EXCEPT B) UNION (B EXCEPT A);

Both methods are not convenient because they use multiple views and thus drag performance down. In the following query, we use FULL OUTER JOIN to counter the disadvantage:

with cte1 as (select STUDENTID,TOTALSCORE from (select STUDENTID, sum(SCORE) TOTALSCORE from SCORES1 group by STUDENTID order by TOTALSCORE DESC) where rownum <= 10), cte2 as (select STUDENTID,TOTALSCORE from (select STUDENTID, sum(SCORE) TOTALSCORE from SCORES2 group by STUDENTID order by TOTALSCORE DESC) where rownum <= 10) select COALESCE(cte1.STUDENTID, cte2.STUDENTID) STUDENTID, cte1.TOTALSCORE, cte2.TOTALSCORE from cte1 FULL OUTER JOIN cte2 ON cte1.STUDENTID=cte2.STUDENTID where cte1.TOTALSCORE IS NULL OR cte2.TOTALSCORE IS NULL

With Oracle SQL, we can use LEFT JOIN UNION RIGHT JOIND to implement FULL JOIN if the involved database is one that does not support FULL JOIN, such as MySQL (Detail query is skipped here).

**SPL solution:**

SPL uses the percent sign “%” to calculate XOR. A%B represents XOR of set A and set B.

A | |

1 | =T("Scores1.csv") |

2 | =T("Scores2.csv") |

3 | =A1.groups(STUDENTID; sum(SCORE):SCORE) |

4 | =A2.groups(STUDENTID; sum(SCORE):SCORE) |

5 | =A3.top(-10;SCORE).(STUDENTID) |

6 | =A4.top(-10;SCORE).(STUDENTID) |

7 | =A5%A6 |

**A1**: Import Scores1 table of from source file.**A2**: Import Scores2 table of from source file.**A3**: Group A1’s table by student ID and sum scores of each student.**A4**: Group A2’s table by student ID and sum scores of each student.**A5**: Get IDs of students whose total scores rank in top 10 in semester 1.**A6**: Get IDs of students whose total scores rank in top 10 in semester 2.**A7**: Use “%” to calculate XOR of student total scores in semester 1 and those in semester 2.

The SQL query is enormously complicated because there isn’t a particular operator in SQL to calculate XOR. SPL, however, is convenient by offering the percent sign “%” to do it.

【Example 10】Based on the following table, find whether customer RATTC ranked in top 3 in the year 2014 in terms of sales amount in a single month. Below is part of the source data:

ID | CUSTOMERID | ORDERDATE | SELLERID | PRODUCTID | AMOUNT |

10400 | EASTC | 2014/01/01 | 1 | 27 | 3063.0 |

10401 | HANAR | 2014/01/01 | 1 | 17 | 3868.6 |

10402 | ERNSH | 2014/01/02 | 8 | 70 | 2713.5 |

10403 | ERNSH | 2014/01/03 | 4 | 42 | 1005.9 |

10404 | MAGAA | 2014/01/03 | 2 | 74 | 1675.0 |

… | … | … | … | … | … |

**SQL solution:**

with cte1 as (select extract (month from ORDERDATE) ORDERMONTH, CUSTOMERID, SUM(AMOUNT) AMOUNT from SALES where extract (year from ORDERDATE)=2014 group by extract (month from ORDERDATE),CUSTOMERID order by ORDERMONTH ASC, AMOUNT DESC), cte2 as (select ORDERMONTH,CUSTOMERID,AMOUNT, ROW_NUMBER()OVER(PARTITION BY ORDERMONTH ORDER BY AMOUNT DESC) grouprank from cte1) select count(*) CUSTOMER_EXISTS from cte2 where grouprank<=3 and CUSTOMERID='RATTC'

SQL does not have a specific method for getting XOR, so we group data by month, calculate the total sales amount of each customer per month, sort the result records, get rankings through row numbers after sorting, select customers entering top 3 in each month, and count records of customer "RATTC". If the count result is 1, it means true; and if it is 0, it means false.

**SPL solution:**

SPL offers A.cor() function to calculate XOR of member sets when A is a set of sets.

A | |

1 | =T("Sales.csv").select(year(ORDERDATE)==2014) |

2 | =A1.group(month(ORDERDATE)) |

3 | =A2.(~.groups(CUSTOMERID; sum(AMOUNT):AMOUNT)) |

4 | =A3.new(~.top(-3; AMOUNT):TOP3) |

5 | =A4.(TOP3.(CUSTOMERID).pos("RATTC")>0) |

6 | =A5.cor() |

**A1**: Import Sales table from the source file and select records of the year 2014.**A2**: Group Sales table by month.**A3**: Group records of each month by customer and calculate total sales amount of each customer in each group.**A4**: Get customers whose sales amounts rank in top 3 in each month.**A5**: Find if there is customer "RATTC" in each group.**A6**: Use A.cor() function to find XOR, that is, whether customer "RATTC" is included in each month. The customer exists if the result is true, and it does not exist if the result is false.

**6. ****B****elong to & include**

Suppose there are two sets A and B. If all members of set A are members of set B, we call that B includes A. The “belong to” relationship is between a member and a set. When there is a member x in set A, we call that x belongs to A.

【Example 11】Based on the following employee table, calculate the average salary in each department in the states of California, New York, Texas and Washington. Below is part of the source table:

ID | NAME | SURNAME | STATE | DEPT | SALARY |

1 | Rebecca | Moore | California | R&D | 7000 |

2 | Ashley | Wilson | New York | Finance | 11000 |

3 | Rachel | Johnson | New Mexico | Sales | 9000 |

4 | Emily | Smith | Texas | HR | 7000 |

5 | Ashley | Smith | Texas | R&D | 16000 |

… | … | … | … | … | … |

**SQL solution:**

SQL IN operator is used to define a “belong to” relationship in WHERE sub statement. The SQL query is as follows:

select DEPT, avg(SALARY) AVGSALARY from EMPLOYEE where STATE in ('California','New York','Texas','Washington') group by DEPT

**SPL solution:**

SPL supplies A.contain(x) function to check whether member x belongs to set A.

A | |

1 | =T("Employee.csv") |

2 | [California,New York,Texas,Washington] |

3 | =A1.select(A2.contain(STATE)) |

4 | =A3.groups(DEPT; avg(SALARY):SALARY) |

**A1**: Import Employee table from the source file.**A2**: Define a constant set of states.**A3**: Select records from A1’s table where the states belong to A2’s set.**A4**: Group the selected records in A3 by department and calculate the average salary in each department.

【Example 12】Based on COURSE table and SELECT_COURSE table, find students who select both Matlab and Modern wireless communication system. Below is part of the s source table:

COURSE:

ID | NAME | TEACHERID |

1 | Environmental protection and sustainable development | 5 |

2 | Mental health of College Students | 1 |

3 | Matlab | 8 |

4 | Electromechanical basic practice | 7 |

5 | Introduction to modern life science | 3 |

… | … | … |

SELECT_COURSE:

ID | STUDENT_NAME | COURSE |

1 | Rebecca Moore | 2,7 |

2 | Ashley Wilson | 1,8 |

3 | Rachel Johnson | 2,7,10 |

4 | Emily Smith | 1,10 |

5 | Ashley Smith | 5,6 |

… | … | … |

The task can be described in another way. It checks whether the COURSE field value in SELECT_COURSE table contains the set [3,6], which are IDs of Matlab and Modern wireless communication system.

**SQL solution:**

The SQL field does not support set data type, so we cannot use the set include relationship to get this task done. With Oracle database here, we use REGEXP_SUBSTR function to split each COURSE value string according to a specific regular expression, left join SELECT_COURSE table and COURSE table to get records selecting both courses, group these records by IDs, get groups containing at least two records, that is, those selecting both courses, and then locate corresponding records from SELECT_COURSE table according to the selected IDs. Below is SQL query:

with cte1 as (SELECT ID,REGEXP_SUBSTR(t1.COURSE ,'[^,]+',1,l) COURSE FROM SELECT_COURSE t1, (SELECT LEVEL l FROM DUAL CONNECT BY LEVEL<=10) t2 WHERE l<=LENGTH(t1.COURSE) - LENGTH(REPLACE(COURSE,','))+1) select * from SELECT_COURSE t5 inner join ( select ID, count(*) from ( select t3.ID, COURSE from cte1 t3 inner join ( select ID from COURSE where NAME='Matlab' or NAME='Modern wireless communication system' ) t4 on t3.COURSE=t4.ID ) group by ID having count(*)>=2 ) t6 on t5.ID=t6.ID

**SPL solution:**

SPL uses A.pos(B) function locate the position of a member of set B in set A and returns null if the member does not exist in set A.

A | |

1 | =T("Course.txt") |

2 | =T("SelectCourse.txt") |

3 | =A1.select(NAME=="Matlab" || NAME=="Modern wireless communication system").(ID) |

4 | =A2.run(COURSE=COURSE.split@cp()) |

5 | =A4.select(COURSE.pos(A3)!=null) |

**A1**: Import Course table from the source file.**A2**: Import SelectCourse table from the source file.**A3**: Get the set of IDs of the target two courses.**A4**: Split each Course value by comma and parse the numbers into a set.**A5**: Use A.pos() function to locate IDs of the target courses in COURSE value of SELECT_COURSE table, and a record that does not make it return null is a desired one.

As SQL lacks support of set type field values, it is a little complicated to handle this case. The language does snot provide a method of checking a set include relationship, so it handles the job using the filtering join. SPL, however, supports set type field values and offers a rich library of functions to locate members of a set, which facilitates the handling of set include relationship judgment cases.

【Example 13】Based on the following weather data in a certain area, find the dates when west wind occurs and when north wind visits in the previous dates. Below is part of the source table:

WEATHER_DATE | RAIN_FALL | WIND_GUST_DIR | WIND_GUST_SPEED | RAIN_TODAY | RAIN_TOMORROW |

2008/12/01 | 0.6 | W | 44 | No | No |

2008/12/02 | 0.0 | WNW | 44 | No | No |

2008/12/03 | 0.0 | WSW | 46 | No | No |

2008/12/04 | 0.0 | NE | 24 | No | No |

2008/12/05 | 1.0 | W | 41 | No | No |

… | … | … | … | … | … |

**SQL solution:**

The task is simple. It aims to find an ordered subset [N,N,W] in WIND_GUST_DIR set. SQL has a weak support for order-based calculations due to its unordered-set-based theoretic foundation (which is explained in Comparison of SQL & SPL: Order-based calculation). When the SQL you are using does not support window functions, you can only do this through table joins. Below is SQL query:

select curr.WEATHER_DATE, RAIN_FALL,curr.WIND_GUST_DIR, WIND_GUST_SPEED,RAIN_TODAY,RAIN_TOMORROW from weather curr inner join (select pre1.WEATHER_DATE,pre1.WIND_GUST_DIR from weather pre1 left join (select WEATHER_DATE,WIND_GUST_DIR from weather) pre2 on pre1.WEATHER_DATE=pre2.WEATHER_DATE+1 where pre1.WIND_GUST_DIR='N' and pre2.WIND_GUST_DIR='N') yest on curr.WEATHER_DATE=yest.WEATHER_DATE+1 where curr.WIND_GUST_DIR='W' order by WEATHER_DATE

The SQL query is roundabout. Each inter-row access requires a self-join. It is inefficient. SQL introduced window functions in the year 2003 and brought in the concept of order. That has made the order-based calculations slightly easier:

select WEATHER_DATE,RAIN_FALL,WIND_GUST_DIR, WIND_GUST_SPEED,RAIN_TODAY,RAIN_TOMORROW from (select WEATHER_DATE,RAIN_FALL,WIND_GUST_DIR, WIND_GUST_SPEED,RAIN_TODAY,RAIN_TOMORROW, LAG(WIND_GUST_DIR,1) OVER ( ORDER BY WEATHER_DATE ASC ) YESTERDAY_DIR, LAG(WIND_GUST_DIR,2) OVER ( ORDER BY WEATHER_DATE ASC ) BEFORE_YESTERDAY_DIR from WEATHER) where WIND_GUST_DIR='W' and YESTERDAY_DIR='N' and BEFORE_YESTERDAY_DIR='N' order by WEATHER_DATE

**SPL solution:**

SPL supports accessing a record previous to the current one or after it for order-based calculations.

A | |

1 | =T("weather.csv") |

2 | =A1.select(WIND_GUST_DIR[-2]=="N" && WIND_GUST_DIR[-1]=="N" && WIND_GUST_DIR=="W") |

**A1**: Import weather data from the source file.**A2**: Select records where WIND_GUST_DIR values are west and the previous two values are north.

**Summary**

SQL gives a good support for concatenation, intersection, union, and difference of two sets but a not good one for XOR. The language does not support theses operations on more sets, and it resorts to roundabout ways to get do jobs. SPL provides special functions for each type of set-oriented operations. This makes SPL code concise, efficient, and easy to understand.

SQL does not support set type field values. When a field value is separated by a certain identifier, SQL cannot perform set include operation on it. The language is awkward and produces complicated code in handling the “belong to” relationship on an ordered subset even with the window function. SPL designs a lot of overall location functions to deal with the set “belong to” relationship. It also supplies a complete set of supports for order-based calculations, which makes it easy to handle inter-row operations.

When the query is complicated, the complexity of SQL query increases by multiple times. It involves the use of temporary table and nested query, etc, which makes it harder to write and maintain the SQL query. SPL, however, can compose succinct code step by step according to the natural way of thinking.

The SPL-driven esProc is the professional data computation engine. It is ordered-set-based and offers a complete set of set-oriented functions, which combines the advantages of both Java and SQL. With SPL, a set-oriented operation becomes simple and easy.