SPL practice: data flow during speeding up batch job

Speeding up batch jobs is one of the major optimization scenarios of SPL, and storing the data of batch job into SPL’s high-performance file is an important step in the optimization process. The data that needs to be dumped usually involves two parts: historical cold data and periodic incremental data (added, deleted or modified data). This article will present how to dump and calculate these two parts of data, as well as how to perform periodic update and regular reorganization.

I. Dump the historical data

Composite table is a high-performance storage format provided by SPL; its principle is to sort the data in advance and then store the data compactly in a compressed manner. The advantage of composite table is that it occupies less space and allows us to quickly locate data record by means of ordered data characteristics.

Let’s take the ORDERS of TPC-H as an example. The code to dump the historical data from database to composite table is as follows:

AB
1fork 4.()=connect@l("oracle12c")
2=B1.cursor@x("SELECT O_ORDERKEY,O_CUSTKEY,O_ORDERSTATUS,O_TOTALPRICE,O_ORDERDATE,O_ORDERPRIORITY,O_CLERK,O_SHIPPRIORITY,O_COMMENT FROM ORDERS WHERE MOD(O_ORDERKEY,4)="/(A1-1)/"ORDER BY 1")
3=file("orders"/A1/".btx").export@b(B2)
4=directory("orders?.btx")
5=A4.(file(~).cursor@b()).merge(#1)
6=file("hisdata.ctx").create(#o_orderkey,o_custkey,o_orderstatus,o_totalprice,o_orderdate,o_orderpriority,o_clerk,o_shippriority,o_comment)
7>A6.append@i(A5)
8=A4.(movefile(~))

Executing this code will generate a composite table file ‘hisdata.ctx’ ordered by the primary key O_ORDERKEY.

II. The appending of incremental data only

2.1 Incremental data and historical data ordered as a whole

When the primary key value of incremental data is just after the maximum primary key value of historical data, the incremental data can be directly appended to the historical data.

For example, the maximum value of the primary key ‘O_ORDERKEY’ of the historical data composite table ‘hisdata.ctx’ is 6000000, and the primary key values of the incremental data ‘newdata.csv’ increase incrementally from 6000001:

O_ORDERKEYO_CUSTKEYO_ORDERSTATUS
136901O
278002O
32130057O

hisdata.ctx

O_ORDERKEYO_CUSTKEYO_ORDERSTATUS
6000001666O
600001154321F
600001212345O

newdata.csv

SPL code:

A
1=file("newdata.csv").cursor@ct()
2=file("hisdata.ctx").open().append@i(A1)

2.2 Incremental data and historical data ordered separately and not as a whole

2.2.1 Small data amount scenario

When the amount of historical data is small, we just need to merge the incremental data with the original table.

For example, the maximum value of the primary key ‘O_ORDERKEY’ of the historical data composite table ‘hisdata.ctx’ is 6000000, and there are primary key values smaller than 6000000 in the incremental data ‘newdata.csv’:

O_ORDERKEYO_CUSTKEYO_ORDERSTATUS
136901O
278002O
32130057O

hisdata.ctx

O_ORDERKEYO_CUSTKEYO_ORDERSTATUS
3444F
50066666O
600001212345O

newdata.csv

SPL code:

A
1=file("newdata.csv").cursor@ct()
2=file("hisdata.ctx").reset(;A1)

2.2.2 Large data amount scenario

As the amount of historical data increases over time, the time to merge with the original table becomes longer and longer. To solve this, we can divide the data composite table file into two parts: the historical data composite table ‘hisdata.ctx’ and the incremental data composite table ‘newdata.ctx’. In this way, we only need to append data to the incremental data composite table, and merge the incremental data with historical data after a period of time. For example, the incremental data needs to be updated daily and reorganized monthly:

AB
1if day(now())==1=file("hisdata.ctx").reset(;file("newdata.ctx").open().cursor())
2=file("newsdata.ctx").create@y(#o_orderkey,o_custkey,…)
3=file("newdata.ctx").reset(;file("newdata.csv").cursor@ct())

When fetching data, it needs to merge the historical data composite table with the incremental data composite table, for example:
=[file("newdata.ctx").open().cursor(),file("hisdata.ctx").open().cursor()].merge(#1)

It should be noted that the cursor sequence should be in the same order as data files, that is, [incremental data, historical data].

If the amount of data is larger, it may also involve dividing data into zones and storing as multi-zone composite table. Refer to: Regular maintenance routine for multi-zone composite table for details.

III. Modification and deletion of incremental data

The modification and deletion of incremental data usually do not involve very large amounts of data, so we only introduce the update method on a single composite table. If the amount of data is huge and multiple composite tables are needed, the multi-zone composite table can be used. For details, visit: Regular maintenance routine for multi-zone composite table.

3.1 Addition and modification only, no deletion involved

When involving the modification of incremental data, the historical data needs to be updated by primary key.

For example, the maximum value of the primary key ‘O_ORDERKEY’ of the historical data composite table ‘hisdata.ctx’ is 6000000, and there is an incremental data ‘newdata.csv’:

O_ORDERKEYO_CUSTKEYO_ORDERSTATUS
136901O
278002O
32130057O

hisdata.ctx

O_ORDERKEYO_CUSTKEYO_ORDERSTATUS
11111F
9999999O
600001212345O

newdata.csv

SPL code:

A
1=file("newdata.csv").cursor@ct()
2=file("hisdata.ctx").reset@w(;A1)

3.2 The deletion of data involved

The historical data is the same as above. The difference is that the incremental data ‘newdata.csv’ adds a status column ‘STATUS’ after its primary key ‘O_ORDERKEY’. This column records four different statuses: B (primary key change), D (delete), A (update), I (insert), and B and I always appear in pairs, for example:

O_ORDERKEYSTATUSO_CUSTKEYO_ORDERSTATUS
1B36901O
6000001I36902F
2A123314F
6000001A36902O
33D136777O
6000002I88888O
6000002A88888F

In this case, it needs to add a deletion flag column (the field can be named arbitrarily, we name it DEL here) after the dimension field when designing the composite table. The values in this column include false and true, which represent the valid record and the invalid record respectively, and the value is set as false by default. The following table is the historical data composite table ‘hisdata.ctx’:

C_CUSTKEYDELC_NAMEC_ADDRESS
1false36901O
2false78002O
32false130057O
33false66958F
34false61001O

It should be noted that when creating a composite table with the deletion flag, the create@d parameter needs to be used so as to make the first column after the dimension be the deletion flag column.

To describe the incremental data ‘newdata.csv’ by status, it is not difficult to convert it to the following form:

C_CUSTKEYDELC_NAMEC_ADDRESS
1true36901O
2false123314F
33true136777O
6000001false36902O
6000002false88888F

SPL code:

A
1=file("newdata.csv").cursor@ct()
2=file("hisdata.ctx").reset@w(;A1)

After merging the composite table that has a deletion flag column, the record that is true in the deletion flag column will be deleted from the composite table. When reading a composite table, we can choose not to read the deletion flag column in the cursor, for example: =file("hisdata.ctx").open().cursor(o_orderkey,o_custkey,o_orderstatus,…)

Leave a Reply