# SPL computing performance test series: funnel analysis

Funnel analysis is a common statistical requirement in e-commerce business. When a user uses the smart device to shop, the system will establish a connection to form a session. Each session contains a number of events, such as visiting a website, browsing a product page, placing an order. The events of each user occur in a certain order; the later the event occurs, the fewer the number of users involved in the event, just like a funnel. The conversion funnel analysis is to count the number of distinct users of each event first, and then perform further calculations (such as conversion rate) based on the statistical results.

Here below are the simplified data structures of an e-commerce system:

sessions table

events table:

eventtype table:

Devicetype table

Description of inter-table relationship:

The sessions table (s for short) and the events table (e for short) are in one-to-many relationship, and are associated through the userid and id of s and the userid and sessionid of e.

Each userid in s corresponds to unique id, and each session corresponds to unique event id.

The e and the eventtype table (et for short) are in many-to-one relationship, and are associated through the eventtype of e and the id of et.

The s and the devicetype table (dt for short) are in many-to-one relationship, and are associated through the devicetype of s and the id of dt.

The input parameter of the test is a specified date like 2022-07-21. This test is to calculate the three- and seven-step funnel conversion rates for events happened within 30 days prior to the specified date.

Let’s take a three-step funnel analysis as an example, and the given time period is from 2022-06-21 00:00 to 2022-07-21 00:00, calculate:

the number of distinct users who performed the visit-type event, represented by step1_count;

the number of distinct users who performed the visit- and view-type events in turn in a day, represented by step2_count;

the number of distinct users who performed the visit-, view- and confirm-type events in turn in a day, represented by step3_count;

Then, based on the calculation results, group and aggregate the numbers of distinct users by the devicetype of sessions table to which these events correspond, and finally divide step3_count by step1_count to calculate the conversion rate of three-step funnel.

The calculation result of three-step funnel is roughly as follows:

The computing process of seven-step funnel is similar to that of three-step funnel, except that there are seven types of events, namely: visit, view, detail, login, cart, confirm, and pay. The calculation result of seven-step funnel is roughly as follows:

This test is divided into two types (A and B) based on whether or not to include the string in the filter condition. Based on type A test, each step of type B test adds a filter condition that contains the substring. For example, in the first step of the funnel calculation, one more condition needs to be added to events: eventmessage contains the substring “check_point”. In the filter conditions of each step of funnel calculation, the substrings of the eventmessage field are different.

## II. Technologies to be compared

SPL Enterprise Edition (May 2023 version) is to be tested. Under the same test environment, we select StarRocks and Oracle to make a comparison with SPL in computing funnel conversion rate.

1. Starrocks, which is claimed to be the fastest SQL-based database. Starrocks 3.0 and 2.5.3 are to be tested.
2. Oracle, which is widely used, and often serves as a benchmark in database performance test. Oracle 19c is to be tested.

Since Oracle is not a professional OLAP database, the performance indicators are usually weak and are for reference only.

## III. Test environment

One physical server with the following configuration:

2 x Intel3014 CPUs, main frequency 1.7G, 12 cores in total
64G memory
SSD (Solid State Drive)

In order to be able to test the computing ability of these products on external storage as well as their sensitivity to memory, we use virtual machines to limit the capacity of memory, and design several test environments with different memory capacities, and each environment has an 8-core CPU.

VM1: 32GB memory; VM2: 16GB memory; VM3: 8GB memory; VM4: 6GB memory; VM5: 4GB memory

For StarRocks, at least two nodes, BE and FE, need to be installed. The BE that undertakes computing task is installed on the VM, and the FE that undertakes management task is installed on the physical machine to avoid affecting test result.

For SPL and Oracle, they only need to be installed on VM.

We test SPL and StarRocks on any VM that can obtain result, yet we test Oracle only on VM1 because its test results are for reference only.

## IV. Test data

The time period of all test data is 2022-06-28 to 07-28.

sessions: 3.5 million rows
events: 280 million rows
eventtype: 678 rows
devicetype: 8 rows

After downloading, unzip it to a directory, configure this directory as the main directory in esProc, and execute dataInit.splx to generate text files.

The test data are stored in four tab-separated text files (without field name):

sessions.txt: the data are sorted by userid and id (in ascending order, the same below).
events.txt: the data are sorted by userid, sessionid, eventtime and id. For the filter conditions on substrings (such as “check_point”) in type B test, the judgment results for events table are all “true”.
eventtype.txt: the data are sorted by id.
devicetype.txt: the data are sorted by id.

The order of the fields is the same as that of the fields in the above table structure.

Taking devicetype.txt as an example, the file memory is:

1 phone
2 other
3 unkown
4 laptop
6 PC
7 TV
8 spider

## V. Test SQL

We test Oracle with the three- and seven-step funnel calculations of type A test on VM1.

We test StarRocks with the three- and seven-step funnel calculations of both type A and B tests on all VMs that can obtain results.

During the test, we find that an error is reported when StarRocks calculates step_count of two or more steps:

Therefore, only step_count of one step can be calculated at a time during the test.

For example, if any two or three out of step1_count, step2_count, and step3_count is calculated in the 3-step funnel, this error will definitely occur. So only one of the three can be calculated during the test.

This error occurs in both the latest version 3.0 and the stable version 2.5.3.

SQL statement for Oracle (type A test, three-step funnel):

```WITH e1 AS (
SELECT
userid,
eventtime AS step1_time,
MIN(sessionid) AS sessionid,
1 AS step1
FROM events e1
JOIN eventtype ON eventtype.id = e1.eventtype
WHERE eventtime>= to_date('2022-07-21','yyyy-mm-dd')-30
AND eventtime< to_date('2022-07-21','yyyy-mm-dd')
AND (eventtype.name = 'visit')
GROUP BY userid,eventtime
), e2 AS (
SELECT
e2.userid,
MIN(e2.sessionid) AS sessionid,
1 AS step2,
MIN(eventtime) AS step2_time,
MIN(e1.step1_time) AS step1_time
FROM events e2
JOIN e1 ON e1.sessionid = e2.sessionid  AND eventtime > step1_time
JOIN eventtype ON eventtype.id = e2.eventtype
WHERE eventtime < step1_time +1
AND (eventtype.name = 'view')
GROUP BY  e2.userid
), e3 AS (
SELECT
e3.userid,
MIN(e3.sessionid) AS sessionid,
1 AS step3,
MIN(eventtime) AS step3_time,
MIN(e2.step1_time) AS step1_time
FROM events e3
JOIN e2 ON e2.sessionid = e3.sessionid  AND eventtime > step2_time
JOIN eventtype ON eventtype.id = e3.eventtype
WHERE eventtime < step1_time+1
AND (eventtype.name = 'detail')
GROUP BY  e3.userid
)
SELECT
dt.name AS devicetype,
COUNT(DISTINCT CASE WHEN funnel_conversions.step1 IS NOT NULL THEN funnel_conversions.step1_userid  ELSE NULL END) AS step1_count,
COUNT(DISTINCT CASE WHEN funnel_conversions.step2 IS NOT NULL THEN funnel_conversions.step2_userid  ELSE NULL END) AS step2_count,
COUNT(DISTINCT CASE WHEN funnel_conversions.step3 IS NOT NULL THEN funnel_conversions.step3_userid  ELSE NULL END) AS step3_count
COUNT(DISTINCT CASE WHEN funnel_conversions.step3 IS NOT NULL THEN funnel_conversions.step3_userid  ELSE NULL END)
/ COUNT(DISTINCT CASE WHEN funnel_conversions.step1 IS NOT NULL THEN funnel_conversions.step1_userid  ELSE NULL END) AS step3_rate
FROM (
SELECT
e1.step1_time AS step1_time,
e1.userid AS userid,
e1.userid AS step1_userid,
e2.userid AS step2_userid,
e3.userid AS step3_userid,
e1.sessionid AS step1_sessionid,
step1, step2, step3
FROM e1
LEFT JOIN e2 ON e1.userid=e2.userid
LEFT JOIN e3 ON e2.userid=e3.userid
) funnel_conversions
LEFT JOIN sessions s ON funnel_conversions.step1_sessionid = s.id
left join devicetype dt on s.devicetype=dt.id
GROUP BY dt.name```

Where:

1. Subquery e1: find out the visit-type events occurred from to_date(‘2022-07-21’,‘yyyy-mm-dd’) to to_date(‘2022-07-21’,‘yyyy-mm-dd’)-30; group the results by userid and eventtime; take the min sessionid in each group to serve as e1’s sessionid, and take eventtime as e1’s step1_time. The constant 1 is the step1 field of the result set e2.
2. Subquery e2: inner join of events table and e1. The conditions to join include that the sessionid of e1 is equal to the sessionid of events table, and the eventtime is greater than e1.step1_time; filter the join results. The conditions to filter include that the eventtime is less than e1.step1_time + 1 day, and the eventtype is view; group by the userid of events table, and take the min sessionid, min eventtime, and min step1_time, to serve as sessionid, step2_time, and step1_time of the result set e2 respectively; the constant 1 is the step2 field of the result set e2.
3. Subquery e3 is similar to e2, except for the eventtype.
4. Subquery funnel_conversions: left join of e1 and e2, e3. The condition to join is that the userids are the same. In the join result, there are userids of e1 to e3, and there are step1 to step3.
5. The outermost layer query: join on the id of sessions table and the sessionid of funnel_conversions, and join on the id of devicetype table and the devicetype of sessions table; group by the name of devicetype table; calculate the number of distinct users of each device type and each step; divide the result of the third step by the result of the first step to get the overall conversion rate.

SQL statement for StarRocks (type B test, seven-step funnel):

```WITH e1 AS (
SELECT
userid,
eventtime AS step1_time,
MIN(sessionid) AS sessionid,
1 AS step1
FROM events e1
JOIN eventtype ON eventtype.id = e1.eventtype
WHERE  eventtime>= DATE_ADD(str_to_date('2022-07-21','%Y-%m-%d') ,INTERVAL -30 day) AND eventtime< str_to_date('2022-07-21','%Y-%m-%d')
AND (eventtype.name = 'visit')
AND eventmessage like '%check%'
GROUP BY userid,eventtime
), e2 AS (
SELECT
e2.userid,
MIN(e2.sessionid) AS sessionid,
1 AS step2,
MIN(eventtime) AS step2_time,
MIN(e1.step1_time) AS step1_time
FROM events e2
JOIN e1 ON e1.sessionid = e2.sessionid  AND eventtime > step1_time
JOIN eventtype ON eventtype.id = e2.eventtype
WHERE eventtime < DATE_ADD(step1_time ,INTERVAL +1 day)
AND (eventtype.name = 'view')
AND eventmessage like '%point%'
GROUP BY  e2.userid
), e3 AS (
SELECT
e3.userid,
MIN(e3.sessionid) AS sessionid,
1 AS step3,
MIN(eventtime) AS step3_time,
MIN(e2.step1_time) AS step1_time
FROM events e3
JOIN e2 ON e2.sessionid = e3.sessionid  AND eventtime > step2_time
JOIN eventtype ON eventtype.id = e3.eventtype
WHERE eventtime < DATE_ADD(step1_time ,INTERVAL +1 day)
AND (eventtype.name = 'detail')
AND eventmessage like '%check_point%'
GROUP BY  e3.userid
), e4 AS (
SELECT
e4.userid,
MIN(e4.sessionid) AS sessionid,
1 AS step4,
MIN(eventtime) AS step4_time,
MIN(e3.step1_time) AS step1_time
FROM events e4
JOIN e3 ON e3.sessionid = e4.sessionid  AND eventtime > step3_time
JOIN eventtype ON eventtype.id = e4.eventtype
WHERE eventtime < DATE_ADD(step1_time ,INTERVAL +1 day)
AND eventmessage like '%check_p%'
GROUP BY  e4.userid
), e5 AS (
SELECT
e5.userid,
MIN(e5.sessionid) AS sessionid,
1 AS step5,
MIN(eventtime) AS step5_time,
MIN(e4.step1_time) AS step1_time
FROM events e5
JOIN e4 ON e4.sessionid = e5.sessionid  AND eventtime > step4_time
JOIN eventtype ON eventtype.id = e5.eventtype
WHERE eventtime < DATE_ADD(step1_time ,INTERVAL +1 day)
AND (eventtype.name = 'cart')
AND eventmessage like '%k_point%'
GROUP BY  e5.userid
), e6 AS (
SELECT
MIN(e6.sessionid) AS sessionid,
1 AS step6,
MIN(eventtime) AS step6_time,
MIN(e5.step1_time) AS step1_time
FROM events e6
JOIN e5 ON e5.sessionid = e6.sessionid  AND eventtime > step5_time
JOIN eventtype ON eventtype.id = e6.eventtype
WHERE eventtime < DATE_ADD(step1_time ,INTERVAL +1 day)
AND (eventtype.name = 'confirm')
AND eventmessage like '%ck_point%'
GROUP BY  e6.userid
), e7 AS (
SELECT
e7.userid,
MIN(e7.sessionid) AS sessionid,
1 AS step7,
MIN(eventtime) AS step7_time,
MIN(e6.step1_time) AS step1_time
FROM events e7
JOIN e6 ON e6.sessionid = e7.sessionid  AND eventtime > step6_time
JOIN eventtype ON eventtype.id = e7.eventtype
WHERE eventtime < DATE_ADD(step1_time ,INTERVAL +1 day)
AND (eventtype.name = 'pay')
AND eventmessage like '%check_poi%'
GROUP BY  e7.userid
)
SELECT
dt.name AS devicetype,
COUNT(DISTINCT funnel_conversions.step7_userid) AS step7_count
FROM (
SELECT
e1.step1_time AS step1_time,
e1.userid AS userid,
e1.userid AS step1_userid,
e2.userid AS step2_userid,
e3.userid AS step3_userid,
e4.userid AS step4_userid,
e5.userid AS step5_userid,
e6.userid AS step6_userid,
e7.userid AS step7_userid,
e1.sessionid AS step1_sessionid,
step1,
step2,
step3,
step4,
step5,
step6,
step7
FROM e1
LEFT JOIN e2 ON e1.userid=e2.userid
LEFT JOIN e3 ON e2.userid=e3.userid
LEFT JOIN e4 ON e3.userid=e4.userid
LEFT JOIN e5 ON e4.userid=e5.userid
LEFT JOIN e6 ON e5.userid=e6.userid
LEFT JOIN e7 ON e6.userid=e7.userid
) funnel_conversions
LEFT JOIN sessions s ON funnel_conversions.step1_sessionid = s.id
left join devicetype dt on s.devicetype=dt.id
GROUP BY dt.name```

The SQL code of StarRocks is roughly the same as that of Oracle. Since StartRocks will have the error mentioned above, only one step of the seven steps is calculated in the outermost layer: COUNT(DISTINCT funnel_conversions.step7_userid) AS step7_count.

When conducting type A test, it only needs to remove the filter condition “eventmessage” of each step.

## VI. Test SPL

SPL code:

1. Data conversion code

The following code is to save the devicetype and eventtype tables as bin file:

For sessions table, save the data as columnar composite table and segment by the first field.

SPL code:

For events table, convert the eventtime to long, save the data as columnar composite table and segment by the first field.

SPL code:

2. Funnel calculation code

Let’s take seven-step funnel calculation of type B test as an example, SPL code:

The given parameter arg_date is a date, such as 2023-03-21.

A3-A5: calculate the date 30 days before the parameter and the date 1 day after the parameter, and convert the three dates to long.

A6: set the seven event types in the table sequence eventtypes, which are used for calculation, as 1-7 in turn, and the rest are all null.

A8: filter out seven event types from the events table, retrieve one more day of data, and establish multiple columnar cursors.

A10-A11: merge the event and session columnar cursors in order, and group by userid.

A12: for each userid group, align and group the eventtype by the order of 1, 2, 3, … ,7, making them correspond to 7 event types respectively. The filter condition in e1 through e7 includes the judgment of substring.

A13: merge e1 and e2 in order by sessionid, calculate the min eventtime that satisfies condition from the merged result, and filter out min values that are not null.

A14: perform distinct calculation on devicetype in e1, and calculate the min eventtime and the min sessionid from the result of merging e1 and e2; starting from e3, filter out the records that meet sessionid and time conditions, i.e., the events in e3 to e7 that meet conditions.

A15: calculate the distinct devicetype based on the cursor in A14; calculate the min time in e2, e3…e7, and merge into the original cursor.

A16: group and aggregate the small result sets in A15; the grouping field is devicetype, calculate the count value of each step in each group.

A17: convert the devicetype in the table sequence in A16 from sequence number to name.

When conducting type A test, we only need to remove the filter condition of eventmessage in A12.

## VII. Test results

1. Type A test (without eventmessage condition):

2. Type B test (with eventmessage condition):

3. Notes

i). The values in the tables are the calculation time, measured in seconds.

ii). Due to long computing time, Oracle is tested only on VM1 in type A test.

iii). Since an error will occur to StarRocks when calculating two or more steps, only one step can be calculated.

## VIII. Comments on test results

1. For funnel calculation, SPL performs the best, StarRocks is worse, and Oracle is the worst. SPL is 2-5 times faster than StarRocks and 15-60 times faster than Oracle in computing speed.
2. In type A test, when the number of funnel calculation steps increases from 3 to 7, the computing speed of StarRocks reduces by nearly 2 times, while that of SPL changes little. The reason is that between SQL steps, the JOIN calculation needs to be performed on sessionid or userid, and the more steps, the greater number of JOIN calculations, resulting in a corresponding decrease in performance. In contrast, SPL combines the same userid and sessionid together for calculation, avoiding repeated JOIN operations.
3. When other test conditions are the same, the computing speed of type B test is generally slower than that of type A test. In comparison, however, SPL changes less than StarRocks in computing speed, for the reason that although StarRocks’ vector-based calculation can make use of CPU characteristics to speed up, this kind of calculation works only for simple data types like int/long and doesn’t work for complex data like string, and will result in a significant performance decrease when involving string computation.
4. StarRocks limits the upper limit of memory (80% capacity of physical memory). Once the memory capacity of VM decreases, OOM will occur, resulting in a failure to obtain result. Because the JOIN operation of SQL is executed in the memory, it will consume more memory space. Especially for calculation involving string in type B test, the memory will bear a greater load, and as a result, OOM is more likely to occur. By contrast, SPL avoids repeated JOIN, making it possible to obtain calculation result with less memory space.
5. Conclusions:

SPL can utilize ordered storage, making it an ideal tool to perform user behavior analysis calculations such as funnel statistics: the total data amount is large, yet the data amount of each userid is not. Storing the data in order by userid and sessionid can not only achieve extreme performance, but also make code more concise.

Funnel statistics is a quite complicated calculation. Although this calculation can be written in SQL, repeated join has to be employed. If this calculation is executed in the old database Oracle, it can get correct results, but the performance is poor. As a database with a short history, when StarRocks faces with such complex SQL code, errors would be reported directly. When only one calculation step is involved, StarRocks runs normally, but its performance is still far behind SPL; when multiple steps are involved, its performance will be poorer.