SPL Practice: integerization during data dump

Using SPL for performance optimization, such as converting data types like string to integer during data dump, can reduce storage space and improve computing performance. This article will present how to implement integerization through a practical example.

Problem description

The following table is the data structure of a certain space-time collision problem:

Field nameField typeField meaningRemarksSample data
noStringObject flagUnique flag of object100000000009
ctIntTimestampUnix timestamp (seconds)1690819200
lacStringSpace flag 140000
ciStringSpace flag 266000000

After understanding the business, we know that all the values of ‘no’ are composed of numbers and can be directly converted to Long integer; ‘lac’ and ‘ci’ always appear in pairs and can be merged as one field called ‘loc’, and the count range of ‘loc’ after de-duplication is approximately 270,000, making it possible to use ‘Int’ integer.

Since the field ‘no’ only involves comparison in subsequent calculations and does not participate in positioning calculation, it does not need to be sequence-numberized, and we just need to convert its value to integer. On the contrary, the field ‘loc’ will involve positioning calculation in subsequent calculations, so it needs to be sequence-numberized, that is, converting its values to natural numbers starting from 1.

Data integerization and dump

The data structure after dumping is as follows:

Field nameData typeField meaningRemarksSample data
noLongObject flagUnique flag of object100000000009
ctIntTimestampUnix timestamp (seconds)1690819200
locIntSpace flagSequence number of space flag10282

Compared with the original data structure, the following two changes are made during data dump:

1. merge the ‘lac’ and ‘ci’ fields as one ‘loc’ field and convert their values to Int sequence number. The original ‘lac’ and ‘ci’ fields, as dimension table, are stored separately;

2. convert the data type (number string) of the ‘no’ field to Long integer.

SPL code:

2=loc_list=A1.query("SELECT DISTINCT lac,ci FROM testdata").keys@i(lac,ci)
4=A1.cursor@x("SELECT no,ct,lac,ci FROM testdata").new(long(no):no,int(ct):ct,loc_list.pfind(lac,ci):loc)

A1: connect to the database; the @l option represents that the returned field name and table name are in lowercase;

A2: fetch the de-duplicated space flags from the database and create an index;

A3: save the dimension table information with space flags to a bin file;

A4: while fetching data from the database, convert the values of ‘no’ to long integer, and utilize the index to find out the sequence number corresponding to ‘loc’ and ‘ci’ from loc_list;

A5 - A9: save the ordered results of ‘no’ and ‘ct’ to a composite table.

Restore through sequence number

Since the dumped fact table loses the information of original ‘lac’ and ‘ci’ fields, we need to restore such information by utilizing the above-mentioned bin file saved when dumping. For example, the following code is to restore the ‘lac’ and ‘ci’ values of a certain calculation result - table sequence T(no, ct, loc):


A1: read the dimension table;

A2: since the value of loc in T correspond to the sequence number of record of dimension table in A1, we can directly take the field corresponding to A1(loc).

New data processing

The data will continue to increase, and the space flag on new data may also be new, so we need to append the data with new space flag to the original dimension table. SPL code:

3=A2.query@x("SELECT DISTINCT lac,ci FROM testdata where ct >="/(long(date(now()))\1000))

In this way, when dumping new data for a day, we just need to store new data into the dimension table bin file ‘loc_list.btx’ where the data has been appended, which will not affect the corresponding relationship of original space flag sequence number. SPL code:

3=A1.cursor@x("SELECT no,ct,lac,ci FROM testdata where ct >="/(long(date(now()))\1000)).new(long(no):no,int(ct):ct,loc_list.pfind(lac,ci):loc)

Leave a Reply