SPL: order-related grouping

SPL: order-related grouping

Sometimes the order of the data makes sense when grouping. We at times group the adjacent records that have the same field values or that meet certain conditions. For example, find out the nation that ranks in the first of consecutive Olympic gold medals, find out how many days at most that the closing price of a stock has been increased, and so on. This is where order-related grouping comes in.

1.Grouping by consecutive the same values

When grouping an ordered set, a new group will be created when the values of the fields for grouping change.

[e.g. 1] According to the table of the Olympic medal tally, find out the nation with the most consecutive first places and its medal information. Some of the data are as follows:

GameNationGoldSilverCopper
30USA462929
30China382723
30UK291719
30Russia242632
30Korea1387

The option @o of A.group() function in SPL enables to create a new group when field values change.

The SPL script looks like this:

A
1=T("Olympic.txt")
2=A1.sort@z(GAME,GOLD,SILVER,COPPER)
3=A2.group@o1(GAME)
4=A3.group@o(NATION)
5=A4.maxp(~.len())

A1: import the Olympic medal table.

A2: sort the Olympic Games and the number of medals (gold, silver, bronze) in descending order.

A3: select one of every Olympic Game, because order is the first one of each game.

A4: create new groups when nations change.

A5: select the group with the largest number of members, which is the group with the most consecutive gold medals.

2.Grouping by adjacent conditions

When an ordered set is grouped, a new group will be created when evaluation result of the grouping condition is true.

[e.g. 2] How many days at most are the closing prices of the Shanghai Composite Index in 2020 consecutively rise? (the rising of the first trading day index). Some of the data are as follows:

DATECLOSEOPENVOLUMEAMOUNT
2020/01/023085.19763066.33572924702083.27197122606E11
2020/01/033083.78583089.0222614966672.89991708382E11
2020/01/063083.40833070.90883125758423.31182549906E11
2020/01/073104.80153085.48822765831112.88159227657E11
2020/01/083066.89253094.23892978725533.06517394459E11

The option @i of the A.group() function in SPL enables to create a new group when conditions change.

The SPL script looks like this:

A
1=T("SSEC.csv")
2=A1.select(year(DATE)==2020).sort(DATE)
3=A2.group@i(CLOSE<CLOSE[-1])
4=A3.max(~.len())

A1: import the Shanghai Composite Index table.

A2: select the records of 2020 and sort them in ascending order of date.

A3: create a new group when the closing price is less than the closing price of the previous day.

A4: calculate the maximum number of days with consecutive rising.

3.Grouping by sequence numbers

Sometimes, we can directly or indirectly get the group number (members should be assigned to which group). In this case, we can directly group by the group number.

[e.g. 3] Divide the employee into three groups based on their working years (numbers with a remainder are assigned to certain group), and calculate the average salary of each group. Some of the data are as follows:

IDNAMEBIRTHDAYENTRYDATEDEPTSALARY
1Rebecca1974/11/202005/03/11R&D7000
2Ashley1980/07/192008/03/16Finance11000
3Rachel1970/12/172010/12/01Sales9000
4Emily1985/03/072006/08/15HR7000
5Ashley1975/05/132004/07/30R&D16000

The option @n of the A.group()function in SPL is used to group by sequence number, and records with the same number are assigned to the same group (number N is assigned to Group N, N starts at 1) .

The SPL script looks like this:

A
1=T("Employee.csv").sort(ENTRYDATE)
2=A1.group@n((#-1)*3\A1.len()+ 1)
3=A2.new(#:GROUP_NO, ~.avg(SALARY):AVG_SALARY)

A1: import the employee table, and sort them by date of entry.

A2: calculate the number of the group to which they belong by the sorted row number, and group them by the number.

A3: calculate the average salary of each group.

Leave a Reply