Python vs. SPL 11 – Many-to-One Association

Python vs. SPL

In Python vs. SPL 10 - One-to-N Association, we introduce one-to-one and one-to-N association. And this article will compare the computational abilities of Python and SPL in many-to-one association.

Foreign key association

When some fields of table A are associated with the primary key of table B, the associative fields of table A can be many, and the associative field of table B is distinct. Such scenario is a many-to-one association, also known as foreign key association, that is, table A is a fact table, and table B is a dimension table. The fields of table A associated with the primary key of table B are called the foreign keys of A to B, and table B is also called the foreign key table of A. For example:

There is a sale record table and a product information table. The calculation task is to aggregate the sale amount of each kind of product.

Some of the data in sale record table, and product information table are as follows:

sale record table (fact table):

recordidproductsale_cityamount
sr100001p1003c104380
sr100002p1005c103400
sr100003p1003c104626

Product information table (dimension table):

productidpclass
p1001A
p1002A
p1003B

Python

import pandas as pd
sr_file1="D:\data\SaleRecord.csv"
pt_file1="D:\data\Product.csv"
record1=pd.read_csv(sr_file1)
product1=pd.read_csv(pt_file1)
r_pt=pd.merge(record1,product1,left_on="product",right_on="productid")
pclass_sale=r_pt.groupby('pclass',as_index=False).amount.sum()
print(pclass_sale)
Fact table
Dimension table
Associate fact table with the foreign keys of dimension table

The merge function in Python associates two tables; sale record table “record1” is the fact table, and product information table “product1” is the dimension table. Many records in “record1” correspond to one record in “product1”, and the names of associative fields in two tables are different, so left_on and right_on mark the associative field of two tables respectively so that the two tables are associated as a wide table, then group and aggregate the records to get the final result.

SPL

AB
1D:\data\SaleRecord.csv
2D:\data\Product.csv
3=file(A1).import@tc()
4=file(A2).import@tc()
5=A3.switch(product,A4:productid)/convert foreign keys to records of dimension table
6=A5.groups(product.pclass;sum(amount):amount)

The switch function in SPL converts the foreign keys to corresponding records of the dimension table, and since they are records now, they can certainly reference to fields which can be used to perform grouping and aggregation operations during grouping.

One fact table & multiple dimension tables

One fact can be associated with multiple dimension tables, for example:

We continue to use the sale record table (fact table) and product information table (dimension table 1), and a new city information table (dimension table 2) is added. The calculation task is to count the sale amount of each kind of product in each province.

City information table (dimension table 2):

cityidnameprovince
c101BeijingBeijing
c102TianjinTianjin
c103HarbinHeilongjiang

Python

#continue to use sr_file1 and pt_file1
ct_file1="D:\data\City.csv"
ct1=pd.read_csv(ct_file1)
r_ct=pd.merge(record1,ct1,left_on="sale_city",right_on="cityid")
r_ct_pdt=pd.merge(r_ct,product1,left_on="product",right_on="productid")
ct_pdt_sale=r_ct_pdt.groupby(['province','pclass'],as_index=False).amount.sum()
print(ct_pdt_sale)
Associate fact table with dimension table 2
Associate fact table with dimension table 1
Group and aggregate

When associating multiple dimension tables, Python usually associates one table first and then the other table. After executing the merge function twice, a big wide table is generated which is used to perform the grouping and aggregate operations.

SPL

AB
/A3 is sale record table, and A4 is product information table
8D:\data\City.csv
9=file(A8).import@tc()
10=A3.switch(product,A4:productid;sale_city,A9:cityid)/set primary key
11=A10.groups(sale_city.province,product.pclass;sum(amount):amount)/group and aggregate

The switch function in SPL can create many foreign key associations simultaneously such as the ID number of product “product” and the “productid” in product information table, and ID number of city “sale_city” and “cityid” in city information table. More associations can be created if needed, and the fields of records can be used to perform grouping and aggregate operations after being associated. Different from Python, SPL can parse multiple associative relations at a time, which makes the association explicit and more efficient.

Reuse dimension table

One fact table may use the same dimension table multiple times, for example:

Based on sale record table and city information table, select the sale record whose sale city and producing city are the same one.

sale record table 2 (fact table):

recordidproductproduct_citysale_cityamount
sr100001p1006c105c103603
sr100002p1005c105c1021230
sr100003p1003c102c102885

City information table 2 (dimension table):

cityidnameprovince
c101BeijingBeijing
c102TianjinTianjin
c103HarbinHeilongjiang

Python

sr_file2="D:\data\SaleRecord2.csv"
ct_file2="D:\data\City2.csv"
record2=pd.read_csv(sr_file2)
ct2=pd.read_csv(ct_file2)
r_ct2=pd.merge(record2,ct2,left_on="sale_city",right_on="cityid")
r_ct_ct=pd.merge(r_ct2,ct2,left_on="product_city",right_on="cityid",suffixes=('_s', '_p'))
r_ct_p_ct= r_ct_ct[r_ct_ct['province_s']==r_ct_ct['province_p']].recordid
print(r_ct_p_ct)
Associate fact table with dimension table for the first time

Associate fact table with dimension table for the second time

The sale records in the example include the ID numbers of sale city and producing city, both of which can be associated with the “cityid” of city information table. Python still uses the same method, executing the merge function twice, and merge function will generate the same field names in the second time, but Python can handle such a problem successfully by adding a different suffix.

SPL

AB
13D:\data\SaleRecord2.csv
14D:\data\City2.csv
15=file(A13).import@tc()
16=file(A14).import@tc()
17=A15.switch(sale_city,A16:cityid;product_city,A16:cityid)/associate fact table with dimension table
18=A17.select(sale_city.province==product_city.province).(recordid)

SPL uses the switch function to associate the same dimension table, but with different foreign keys (sale_city and product_city), and then uses the associated record fields to select the target result.

Multi-layer dimension table

Foreign key association may involve more than one layer of dimension table. In other words, there are scenarios of multiple layers of the dimension table. For example:

Based on the sale record table, product information table, and city information table, select the sale record whose sale city and producing city are in the same province.

Sale record table (fact table):

recordidproductsale_cityamount
sr100001p1003c104380
sr100002p1005c103400
sr100003p1003c104626

City information table (dimension table 1):

cityidnameprovince
c101BeijingBeijing
c102TianjinTianjin
c103HarbinHeilongjiang

Product information table (dimension table 2):

productidproduct_city
p1001c104
p1002c103
p1003c102

Python

sr_file3="D:\data\SaleRecord3.csv"
ct_file3="D:\data\City3.csv"
pt_file3="D:\data\Product3.csv"
record3=pd.read_csv(sr_file3)
product3=pd.read_csv(pt_file3)
ct3=pd.read_csv(ct_file3)
pdt_ct=pd.merge(product3,ct3,left_on="product_city",right_on="cityid")
r_pdt_ct=pd.merge(record3,pdt_ct,left_on="product",right_on="productid")
r_pdt_ct_ct=pd.merge(r_pdt_ct,ct3,left_on="sale_city",right_on="cityid",suffixes=('_s', '_p'))
r_ct_p_ct2=r_pdt_ct_ct[r_pdt_ct_ct['province_s']==r_pdt_ct_ct['province_p']].recordid
print(r_ct_p_ct2)
Associate producing city with city

Associate sale record with produc

tAssociate sale city with city

The city information is the dimension table of both product and sale record; product information is also the dimension table of sale record, which constitutes multiple layers of dimension tables together, and there is dimension table that is associated multiple times. Python uses the merge function three times for three associations.

SPL

AB
20D:\data\SaleRecord3.csv
21D:\data\City3.csv
22D:\data\Product3.csv
23=file(A20).import@tc()
24=file(A21).import@tc()
25=file(A22).import@tc()
26=A25.switch(product_city,A24:cityid)/associate producing city with city
27=A23.switch(sale_city,A24:cityid;product,A26:productid)/associate sale record with city and product
28=A27.select(sale_city.province==product.product_city.province).(recordid)

Once an association is created, SPL can use it all the time, even when the association is created again. For example, we create associations on producing city and city in A26, and on sale record and product in A27; besides, the association between producing city and city still exists. Therefore, we can have reference of product.product_city.province in A28, which is quite convenient for multiple table association.

Self-association

Sometimes we may also encounter a scenario where a table is both a fact table and a dimension table, i.e., the table associates with itself. For example:

There is an employee information table, and the calculation task is to list names of all employees and their superiors.

Some of the employee information table are as follows:

empidnamesuperior
7902FORD7566
7788SCOTT7566
7900JAMES7698

Python

emp_file="D:\data\Employee_.csv"
emp=pd.read_csv(emp_file)
emp_s=pd.merge(emp,emp,left_on="superior",right_on="empid",suffixes=('', '_m'),how="left")
emp_s_name=emp_s[['name','name_m']]
print(emp_s_name)
Self associate

The operation of Python is still two-table association essentially.

SPL

AB
30D:\data\Employee_.csv
31=file(A30).import@tc()
32=A31.switch(superior,A31:empid)/self associate
33=A32.new(name,superior.name:s_name)

SPL also follows the same operation logic, using switch function to associate “superior” and “empid”.

Circleassociation

When associative relation is complex, circle association may occur. For example:

There is an employee information table and a department information table, and the calculation task is to select Beijing employees of Beijing manager.

Employee information tale:

empidnamedeptprovince
1Rebecca6Beijing
2Ashley2Tianjin
3Rachel7Heilongjiang

Department information table:

deptidnamemanager
1Administration20
2Finance2
3HR162

Python

emp_file2="D:\data\Employee_2.csv"
dept_file2="D:\data\Department2.csv"
emp2=pd.read_csv(emp_file2)
dept2=pd.read_csv(dept_file2)
d_emp=pd.merge(dept2,emp2,left_on="manager",right_on="empid")
emp_d_emp=pd.merge(emp2,d_emp,left_on="dept",right_on="deptid",suffixes=('', '_m'))
beijing_emp_m=emp_d_emp[(emp_d_emp['province']=="Beijing")& (emp_d_emp['province_m']=="Beijing")].name
print(beijing_emp_m)
Associate department table with employee table
Associate employee table with department table
Select

The above two associations are relatively independent from each other in Python. These two associations constitute a circle association to generate a wide table, and then the target result is selected.

SPL

AB
35D:\data\Employee_2.csv
36D:\data\Department2.csv
37=file(A35).import@tc()
38=file(A36).import@tc()
39=A38.switch(manager,A37:empid)/associate department table with employee table
40=A37.switch(dept,A38:deptid)/associate employee table with department table
41=A40.select(province=="Beijing"&&dept.manager.province=="Beijing").(name)/select

SPL handles such association in three steps: first, it creates association on department and employee; second, it creates association on employee and department; third, it directly selects the target result using the created associations. The association operations in the previous examples are all done with the switch function which possesses a feature: the original field values will be replaced with the associated records once the association is done, and the original record values will not exist any longer. If we want to keep the original record values, the join function can be used to perform the association. For example, A40 in the example can be written as:

A40=A37.join(dept,A38:deptid,~:dpt). At this time, “dept” is the associated records which can be referenced to perform the subsequent operations. And the switch function in the previous examples can all be used in this way.

Mixed association

During data analysis, we may encounter mixed associations where homo-dimension, primary-sub, and foreign key associations occur at the same time, and it is when the associative relations are very complex and need to be clearly sorted out. For example:

Based on the order table, order detail table, product information table, employee information table, travel information table, client information table, and city information table, the task is to calculate the sale amount of Heilongjiang products sold in each province by post-90s salesman who travel for more than 10 days.

The associative relations are shown below:

Python

emp4 = pd.read_csv("D:\data\Employee4.csv")

trv4 = pd.read_csv("D:\data\Travel4.csv")

emp_inf = pd.merge(emp4,trv4,on=["empid","name"])

years = pd.to_datetime(emp_inf.birthday).dt.year

emp_inf_c = emp_inf[(years>=1990) & (years<2000)&(emp_inf.time>=10)]

clt4 = pd.read_csv("D:\data\Client4.csv")

city4 = pd.read_csv("D:\data\City4.csv")

sale_location =
pd.merge(clt4,city4,left_on='city',right_on='cityid')

pdt4 = pd.read_csv("D:\data\Product4.csv")

pdt_location =
pd.merge(pdt4,city4,left_on='city',right_on='cityid')

detail4 = pd.read_csv("D:\data\Detail4.csv")

order4 = pd.read_csv("D:\data\Order4.csv")

detail_pdt =
pd.merge(detail4,pdt_location,on='productid',how="left")

order_sale_location =
pd.merge(order4,sale_location,on='clientid',how="left")

order_sale_location_emp = pd.merge(order_sale_location,emp_inf_c,left_on='saleid',right_on='empid',how="left",suffixes=('_c', '_e'))

order_inf =
order_sale_location_emp[order_sale_location_emp.empid.notnull()]

order_detail =
pd.merge(order_inf,detail_pdt,on='orderid',how="left",suffixes=('_s', '_p'))

order_detail_Hljp =
order_detail[order_detail.province_p=="Heilongjiang"]

res =
order_detail_Hljp.groupby(['empid','name_e','province_s'],as_index=False).price.sum()

print(res)
Employee table and travel table

Select the post-90s employees

Client table and city table

Product table and city table

Order detail and producing city

Order and sale city

Order and employee

Order and order detail

Select

Group and aggregate

There are many tables in this example with complex associative relations which are homo-dimension association (one-to-one), primary-sub association (one-to-many), and foreign key association (many-to-one), respectively. If there exists an association of many-to-many, it is most likely wrong, and the association needs to be rechecked, otherwise, many-to-many association probably leads to memory explosion. As for such complex associations, the best method provided in Python is to use the merge function to associate every two tables and parse each association step by step, which may be a bit troublesome but less prone to errors.

SPL

A
43=file("D:/data/Employee4.csv").import@tc()
44=file("D:/data/Travel4.csv").import@tc()
45=A44.join(empid,A43:empid,birthday)
46=A45.select((y=year(birthday),y>=1990&&y<2000&&time>=10))
47=file("D:/data/Client4.csv").import@tc()
48=file("D:/data/City4.csv").import@tc()
49=A47.join(city,A48:cityid,province)
50=file("D:/data/Product4.csv").import@tc()
51=A50.join(city,A48:cityid,province)
52=file("D:/data/Detail4.csv").import@tc()
53=file("D:/data/Order4.csv").import@tc()
54=A52.join(productid, A51:productid,province:product_province)
55=A54.group(orderid)
56=A53.switch(orderid, A55:orderid;saleid, A46:empid;clientid, A49:clientid)
57=A56.select(saleid).new(saleid.empid:empid,saleid.name:sale_name,clientid.province:sale_location,orderid.select(product_province=="Heilongjiang").sum(price):price)
58=A57.groups(empid,sale_name,sale_location;sum(price):price).select(price)

SPL is quite capable to handle such complex associations, in which the homo-dimension, primary-sub, and foreign key associations are all very clear. SPL can also associate two associations simultaneously, which is very fast and less error-prone.

Summary

When performing foreign key association, Python still copies data, and only parses one association at a time. In addition, every association is independent, so the association created previously can not be reused later, instead, it has to be re-associated, which results in low efficiency in association.

On the contrary, SPL can reuse the associations that created previously, making the operation much more effective.

Leave a Reply