# SPL practice: query massive and flexible structured data

## Problem description

A flexible data structure means that each record of a data table has different data structure. Usually, the fields of data table are divided into two parts, one part is the fields that are common to all records (common field for short), and the other part is the fields that vary from record to record (flexible field for short). There may be as many as hundreds of flexible fields, yet each record may have only a few of them. In complex situations, these flexible fields are further classified.

Here is a student score table, and this table does not involve classification:

Some students do not take all subjects as elective course, so they do not have score for the unelected subject, represented by a slash.

The following is a sports student score table, and this table involves classification:

The common fields contain the basic attribute of each student. The flexible fields are classified into different attributes (basketball and football in this example), and each attribute has the same classes (comments [string], National player or not [boolean] and score [value] in this example).

Now we will focus on the complex situation involving classification. The situation without classification is a special case of complex situation.

Let’s take the following query criteria as an example:

Find out the records whose gender is “male” in the common field (primary table)

and,

find out the records whose class value “national player or not” under the attribute “basketball” is “yes” and whose class value “score” under the attribute “football” is less than 60 in the flexible field (sub table).

In essence, the data structure described in the two examples can be logically regarded as a big wide table. However, when there are too many attributes and classes, this table will have thousands of fields (common fields + all possible attributes * classes). If this table is processed into a physical wide table, it will result in too many fields, most of which are blank.

In general, SQL describes such data structure through the following two tables:

Main table (common field)

Sub table (flexible field)

In this table, the classes refer to the fields comment, nt, and rate. To query the records according to the above-mentioned query criteria, coding in SQL will be:

```SELECT *
FROM main m, subs
WHERE m.id= s.id AND
(SELECT count(*)
FROM sub ss
WHERE m.id= ss.id AND (ss.atr= 1 AND ss.nt= 1) or (ss.atr=2 AND ss.rate<60) ) = 2```

If this query task is handled in SPL, there are usually three methods:

## I. Use json-style field

For such flexible data structure, json-style field is usually used.

When no classification is involved, the json string is a record. For example, the scores of student B in the student score table:

```{
“Physics”: 99,
“Biology”: 70
}
When involving classification, it is a table sequence. For example, the scores of student C in the sports student score table:
[
{
“Comments”: the student is good at...
“National player or not”: yes
“Score”: 20
},
{
“Attribute”: football,
“Comments”: the student is skillful at...
“National player or not”: yes
“Score”: 10
}
]```

Using the json-style field will increase the storage amount (as each record or table sequence needs to store the field names once), and it always needs to parse json as the record or table sequence before calculation, which will result in poor performance. It is very convenient for SPL to process json-style field, and the corresponding code is relatively conventional, so we won’t go into details here.

Now let’s look at two other methods.

## II. Use sequence type field

Let’s take the aforementioned “sports student score table” as an example, using the sequence type field is to regard all attributes of the sub table as a sequence field, and the class is then converted to the corresponding sequence. See the table below for details:

### 2.1 Data structure

Define the test data structure as follows:

Main table (common field)

Sub table (flexible fields)

### 2.2 Move the data from database and store

The code for taking the data from database:

### 2.3 Generate test data

For the convenience of testing, we write the following SPL code to directly generate a composite table file seq_all.ctx:

### 2.4 Query

Case 1: the age is greater than or equal to 20 and less than or equal to 25, and the classes include 1, 3, 6

Case 2: the class value “national player or not” under the attribute “basketball” is “yes”, and the class value “score” under the attribute “football” is less than 60

A5 is a conditional expression where j is to join multiple sequence fields into a table sequence, and then count the number of each record that satisfies the condition. If the number is equal to the one in the condition, it means that this record needs to be taken.

## III. Use attached table

The data structure in 2.1 can be regarded as the primary-sub table relationship. When two tables are in primary-sub relationship, we can regard the common field part as a base table, and the flexible field part as an attached table. For more information about attached table, visit: SPL Attached Table

### 3.2 Generate test data

The script for generating the attached table data based on the file generated in 2.3:

### 3.3 Query

The query criteria are the same as that described in 2.4.

### 3.4 Search twice

When using the sequence type field, we can utilize the pre-cursor filtering technology to improve performance. However, the conditions on the sub table in attached table require reading all columns first, though some fields are unwanted. If few results are obtained by conditional filtering, it means many useless fields will be read.

Therefore, when using attached table, it is likely to be faster to first take the primary key by the condition-related fields before searching. In contrast, for the sequence type field, it is unnecessary to process this way.

First, find out the primary key sequence of the records that meet conditions, and then search for the results through this sequence. SPL script:

## IV. Use row-based storage

SPL’s columnar storage adopts the data block and compression algorithm, leading to less access data volume and better performance as for the traversing operation. But for scenarios of index-oriented random data fetching, the complexity is much bigger due to the extra decompression process and the fact that each fetching is performed on the whole block. Therefore, the performance of columnar storage would be worse than that of row-based storage in principle.

For the same data, it is impossible to achieve optimal performance in both traversal operation and random data fetching. To obtain ultimate performance, we can store a redundant row-based storage composite table file, and establish an index for the primary keys of this composite table.

The script for converting to row-based storage and creating an index is as follows:

First, find out the primary key in the way of attached table, and then search for the results in the row-based storage index through the primary key. The script is as follows: