User Behavior Analysis in Practice 8: The Changing Dimension Table

User Behavior Analysis in Practice 8: The Changing Dimension Table

Target task

We have a user events table T. Below is its structure and part of its data:

TimeUserIDProductIDQuantity
2022/6/1 10:20107275517
2022/6/1 12:12107803028
2022/6/1 12:36100509333
2022/6/1 13:21104865549
2022/6/1 14:46103782455
2022/6/1 15:19104962664
2022/6/1 16:00100929676
2022/6/1 16:39107071387
2022/6/1 17:40109088494

Fields in table T:

Field nameData typeDescription
TimeDatetimeTime stamp of an event, accurate to milliseconds
UserIDStringUser ID
ProductIDIntegerProduct ID
QuantityNumericQuantity

Dimension table Product:

ProductIDProductNameUnitOriginProductType
1ApplePoundShandongFruits
2TissuePacksGuangdongHome&Personalcare
3BeefPoundQingdaoMeat
4WineBottlesShanxiBeverage
5PorkPoundXizangMeat
6BreadPacksBeijingBakery
7JuiceBottlesXinjiangBeverage

Fields of dimension table Product:

Field nameData typeDescription
ProductIDStringProduct ID
ProductNameStringProduct name
UnitStringUnit
ProductTypeIDStringProduct type
OriginNumericProduct origin

Computing task:

Find the sales quantity of each type of product of each place of origin within the specified time period.

One thing we should take into consideration is that the place of origin is not fixed, and get sales quantity of a type of product according to the place of origin in the current transaction.

Techniques involved:

Employing time key on the dimension table. You can find more about SPL time key mechanism in Why Are There So Many Snapshot Tables in BI Systems?.

1. Generate a dimension table with time key according to the dimension table in the production system at regular time interval.

Add a time key field to Product.btx. Here’s the dimension table’s new structure:

Field nameData typeDescription
eTimeDatetimeEffective time for the current record
ProductIDStringProduct ID
ProductNameStringProduct name
UnitStringSales unit
ProductTypeIDStringProduct type
OriginNumericPlace of origin

Perform daily data dump on Product.btx with a time key according to the updated information in the production system’s Product table every day.

2. Summarize data using the dimension table with a time key

Define a new composite primary key using eTime and ProductID for the dimension table while specifying eTime as the time key. Table T is associated with Product through Time and ProductID fields. Programmers do not need to take care of how to achieve the time key because SPL is designed to automatically handle the related computing logic. They just use the dimension table as a regular one.

Sample code

1. Dump dimension table Product.btx.

For original data: Directly add the time key field.

A
1=connect("demo").query@x("select * from Product").derive(now():eTime)
2= file("Product.btx").export@b(A1)

When data is updated: Import the new dimension table, get the record with the latest time under each type of product from the dumped btx file with the time key, compare them with the new table, and append a new record to the btx file for the updated data and record a new effective time.

A
1=connect("demo").cursor@x("select * from Product")
2= T("Product.btx").keys@it(ProductID,eTime)
3=A1.select(cmp(~.array(),A2.find(A1.ProductID).array().to(2,))!=0)
4= A3.fetch().derive(now():eTime)
5=file("Product.btx").export@ab(A4)

A1 Import the new dimension table and set index on the primary key.

A2 Import the old dimension table Product from the btx file, get record having the latest time for each product, and set index on the primary key.

@t option means the last key field is the time key.

A3 Get records under each primary key value from the new dimension table where the latest (largest) eTime is different from that in the records under same key value in the old dimension table. The find function will automatically select the record having the latest time corresponding to same key value.

A4 Add effective time field to A3’s records.

A5 Append and export A4’s records to the bin file.

2. Join the fact table and the dimension table and perform aggregation

Suppose T.ctx is already generated as the above explains and sorted by Time:

A
1>Product=T("Product.btx").keys@ti(ProductID,eTime)
2>start=date("2022-03-15","yyyy-MM-dd"),end=date("2022-06-16","yyyy-MM-dd")
3=file("T.ctx").open().cursor(ProductID,Quantity,Time;Time>=start && Time<=end)
4=A3.switch(ProductID:Time,Product)
5=A4.groups(ProductID.ProductType,ProductID.Origin; sum(Quantity):Quantity)

A1 Import dimension table Product, and set index on the primary key while specifying the time key.

A4 Join the fact table and the dimension table using the familiar SPL join syntax. As the joining fields contains a time field, the operation will find the record having the latest time (the largest time that is smaller than eTime) before the current time field value.

Execution result:

ProductTypeOriginQuantity
FruitsShandong1241628
FruitsXinjiang546357
FruitsHainan24526
Home&PersonalcareGuangdong7411008
MeatQingdao3303230
MeatNeimeng657546
MeatXizang2456235
BakeryBeijing247673
BeverageXinjiang3526574
BeverageShanxi6090112

Leave a Reply