# Python vs. SPL 12 – Big Data Processing

In data analysis, we usually encounter the data that is too big to fit in memory and has to be processed on hard disk. In this article, we’ll compare the calculation abilities of Python and SPL for such order of magnitude of data. As for bigger order of magnitude like PBs, it requires the distributed system to analyze the data, which is out of the scope of this article.

## Aggregation

A simple aggregation only needs to traverse the data and calculate the aggregation column once according to the aggregation target. For example: sum - accumulate the read data when traversing the data; count - record the number of traversals when traversing the data; mean - record the numbers of both accumulation and traversals when traversing the data, and then divide the numbers. Here we illustrate the sum calculation.

Based on the following file, calculate the total amount of orders.

Some of the data are:

```orderkey orderdate      state       quantity amount
1     2008-01-01  Wyoming      10   282.7
2     2008-01-01  Indiana   3     84.81
3     2008-01-01  Nebraska       82   2318.14
4     2008-01-01  West Virginia 9     254.43```

Python

Python can read the file line by line, find the order amount field of each line, and keep aggregating to get the sum result, which does not need any skill but just hard-coding. Python can also use Pandas to read the file in blocks in order to perform the sum operation, and the code is as follows:

Pandas allows to read data in blocks, and it is very easy to aggregate each block, then loop through all blocks, and aggregate all blocks to get the final sum result. This method is slightly simpler in coding and more efficient in computing, however, it is still considered as hard-coded and troublesome to write.

SPL

SPL uses the cursor to calculate the big data, and provides abundant cursor calculating functions. And total function is one of them, which allows multiple aggregation operations simultaneously in order to use the cursor multiple times. For example, if we want to get the sum and maximum value at the same time, the code can be written as:

A2.total(sum(amount),max(quantity))

Such a concise piece of code is able to aggregate the total order amount and calculate the maximum number of orders, which is much simple and more efficient compared to Python.

## Filtering

The filtering operation is similar to aggregation: divide a big file into N segments, filter each segment separately, and finally union the result of each segment to get the target result.

Based on the data of the last example, filter out the sale information of New York State.

### Small result set

Python

Python uses Pandas to segment and filter the data, thus gets the target result, but it is still difficult to write. The filtering operation can also be done by reading the data line by line. This method does not need any skill and can be coded just according to the logic, so it will be omitted here.

### Big result set

A big result set is one that does not fit in memory even after filtering.

The big result set can not be stored in memory, so we have to store the filtered result on hard disk for later calculations. Pandas reads the file in blocks, then filters each block and stores the result on hard disk.

SPL

SPL uses the select function to filter the cursor, which is used differently from that calculated in memory. Instead of immediately executing the select action, it only records the to-be-executed select action, and actually executes it when performing “fetch” or exporting.

Please note that A4 is to store the small result set in memory; A5 is to store the big result set on hard disk, which can not be executed simultaneously.

## Sorting

Sorting big data is also a very common operation, which may consume a lot of resources. For example:

Sort the orders according to the order amount.

Python

Python provides no ready-made function for sorting on external storage, which can only be written by ourselves. But the code for merging and sorting on external storage is much more complicated than those for aggregation or filtering. It is more like an impossible task for many non-professional programmers, and its computational efficiency is not prominent either.

SPL

SPL provides the sortx()function for sorting on cursor, which is written similarly to the sort() function for in-memory operations, except that it returns a cursor and fetches or exports the data to hard disk. This is an outstanding improvement over the hard-coding in Python, both in writing and computing.

## Grouping and association

It is far too difficult for Python to perform grouping or association on big data. Because such calculation involves Hash and is impossible to be implemented in Python for the average programmers, so we won’t illustrate the code in Python here.

Let’s dive into the grouping and association operations in SPL instead.

### Grouping

The calculation task is to aggregate the sale amount of orders in each state.

SPL

The groups function in SPL supports grouping and aggregation on cursor, which is used in the same way as that in memory. Besides, since it has implemented the effective code of Hash grouping internally, it is easy to write and efficient to compute.

### Association

Suppose there are ID fields of customers in the order table, and the customer information needs to be associated during the operation.

SPL

SPL provides plenty of cursor functions, among which the switch function is used similarly to that in memory, but it returns a deferred cursor and only executes switch action actually when the code is executed later. The whole code is similar to that of in-memory function, which is easy to write and understand with fast computational speed.

## Summary

It takes Python a lot of efforts to process big data, mainly because it does not provide cursors and related operations for big data. So we have to write code by ourselves, which is not only tedious but also inefficient.

While SPL possesses a well-developed system of cursor, and the usage of most cursor functions is similar to that of in-memory functions. Therefore, they are very programmer-friendly, and execute very efficiently because of effective internal algorithms.