Python vs. SPL 4 – Selection and Positioning

Selecting subset from a set is a very common operation, for example, selecting members who are more than 40 years old from all the members in the company. In this article, we’ll compare the selection operations between Python and SPL.

Select the member at a specified position

Based on the employee information table of the company, select the employees in even positions.

Python

The index of Python starts from 0, so the row indexes (i.e., sequence number) in even positions are odd numbers. We use the query(...) function to select the members whose row indexes are odd numbers (i.e., the even positions). But we’ll have to create a list of even positions and then select the required members using the iloc function if the row indexes are not numbers or ordered indexes.

SPL

In A.select(...) function, we can use symbols “~” and “#” to indicate the current member and its position respectively. Here we use #%2==0 to indicate the elements in even positions.

Select the eligible members

Selecting the members that satisfy the conditions is the most common selection operation, and here we still illustrate with the employee information table.

To query the full name and salary of the employees of the R&D department in New York State.

Python

In simple logic, Python uses the query(...) function to select the members that satisfy the conditions. But it is a bit troublesome to generate a target table based on the selection results.

SPL

In SPL, we use the same A.select() function to select by position or by condition, and the operation of generating the target table is also very easy.

Select the positions/indexes of eligible members

When performing the selection, we want to get the positions or indexes of the eligible members and then use the information to execute other operations.

For example, to select the positions or indexes of the employees who are over 50 years old in the employee information table.

Python

In Python, there is no ready-made functions for calculating age, which can only be completed with the help of numpy and datetime libraries, besides, math library should also be used to calculate integers. These requirements are already very messy, what’s worse, Pandas library is based on numpy and some functions are not fully inherited. For example, Pandas have the pd.Timedelta()function which calculates the time difference, but only the options of month, day, hour, minute, and second are available; the year option is not included. In addition, the anywhere() function can only be used on the converted data structure of “array”. Therefore, the simple conditional filtering requires multiple transition steps, making it very burdensome.

SPL

The age()function in SPL is used to calculate the age, which is much more convenient. Also, in accordance to the select function, SPL provides the pselect@a() function to return the positions of the eligible members, which can be achieved with just one line of code. Many selection functions in SPL have their corresponding functions of returning positions, for example, pmax(), pmin(), ptop(), etc.

Select the member with maximum/minimum value

Sometimes we also want to query the full information of the members with extreme values. For example:

To query the information of the employee with the highest salary.

Python

When selecting a member with maximum/minimum value, Python uses the argmax()or argmin() function to get the index of the target value and then select the member according to the index, which is more efficient. However, the argmax()and argmin() functions do not return all the indexes of the same value, so we need to traverse the data twice (the first traversal calculates the maximum and minimum values; the second one fetches the values) if more than one members with maximum/minimum values need to be returned. In this way, the efficiency is degraded by half, and the logic of coding is different from the previous one, making it hard to memorize.

SPL

SPL provides maxp()and minp() functions to return the member with maximum/minimum value, and @a option is used to return all the members with target value. Both functions traverse the data once to return the results and are very easy to memorize in terms of the syntax.

Many functions in SPL have three forms: selecting the value itself such as max/min; selecting the members corresponding to target values such as maxp/minp; selecting the positions corresponding to target values such as pmax/pmin. All of them are in the same style and very easy to memorize.

For instance, to query the position of the employee with highest salary in the employee information table.

=A2.pmax@a(SALARY). This simple code is actually competent to get the result.

Select topN members

Selecting topN members is also a common operation, for example:

To query the information of the first 10 employees joining the company.

Python

The topN operation in Python first sorts the data and selects the top N members, which is acceptable for small data amount. But this method will be less efficient with large data amount.

SPL

The top()function in SPL is an aggregation function, which is similar to the max() function, but max()function keeps one maximum value while top() function keeps a sequence of topN. The later function dose not sort the whole data, thus decreasing the extra consumption of sorting.

Similarly, we can use the ptop() function to query the positions of topN members.

Summary

Some basic selections in Python are relatively easy, such as position selection and conditional selection; some selection operations are a bit convoluted, such as selecting the eligible positions and maximum/minimum values; and some are too ineffective such as topN selection. It is also quite complicated to switch between Pandas and Numpy libraries frequently.

However, SPL provides abundant selection and positioning functions of an uniform style, making it very easy to complete the selection and positioning operations whether they are simple or complex.