User Behavior Analysis in Practice 5: Using Dimension Table

User Behavior Analysis in Practice 5: Using Dimension Table

Target task:

We have a user events table T. Below is its structure and part of its data:

TimeUserIDEventTypeIDProductIDQuantity
2022/6/1 10:2010727553100001
2022/6/1 12:1210780302100002
2022/6/1 12:36100509351000033
2022/6/1 13:2110486551
2022/6/1 14:4610378246
2022/6/1 15:19104962641000044
2022/6/1 16:00100929651000056
2022/6/1 16:3910707132100006
2022/6/1 17:4010908843100007

Fields in table T:

Field nameData typeDescription
TimeDatetimeTime stamp of an event, accurate to milliseconds
UserIDStringUser ID
EventTypeIDIntegerEvent type ID
ProductIDStringProduct ID
QuantityNumericQuantity

Dimension table EventType:

EventTypeIDEventType
1Login
2Browse
3Search
4AddtoCart
5Submit
6Logout

Dimension table Product:

ProductIDProductNameUnitPriceProductTypeID
100001ApplePound5.51
100002TissuePacks162
100003BeefPound353
100004WineBottles1204
100005PorkPound253
100006BreadPacks105
100007JuiceBottles64

Fields in dimension table Product:

Field nameData typeDescription
ProductIDStringProduct ID
ProductNameStringProduct name
UnitStringSales unit
PriceNumericUnit price
ProductTypeIDIntegerProduct type ID

Dimension table ProductType:

ProductTypeIDProductType
1Fruits
2Home&Personalcare
3Meat
4Beverage
5Bakery

Relationship between tables:

undefined

Computing task:

Calculate the total sales amount, number of orders, search frequency and the number of distinct users performing search and ordering under each type of product within a specified time period.

Techniques involved:

1 Join tables through dimension table rather than generating a wide table. This can reduce the volume of data to be stored and increase retrieval speed.

2 Use a global variable to pre-load dimension tables and establish associations for later reuse.

Sample code

1. According to our previous practices, we dump data from user events table T and store it in composite table T.ctx according to the order of Time field; and then dump data in those dimension tables as bin files EventType.btx, Product.btx and ProductType.btx.

2. Import each dimension table into memory, set primary key for them, open the composite table cursor, establish associations with dimension tables, and perform grouping &aggregation.

Suppose we need to summarize data that falls in between 2022-03-15 and 2022-06-16:

A
1>start=date("2022-03-15","yyyy-MM-dd"),end=date("2022-06-16","yyyy-MM-dd")
2=file("T.ctx").open().cursor(UserID,EventTypeID,ProductID,Quantity;Time>=start && Time<=end && (EventTypeID==5 || EventTypeID==3))
3>EventType=file("EventType.btx").import@b().keys@i(EventTypeID)
4>ProductType=file("ProductType.btx").import@b().keys@i(ProductTypeID)
5>Product=file("Product.btx").import@b().keys@i(ProductID)
6>Product=Product.switch(ProductTypeID, ProductType:ProductTypeID)
7=A2.switch(ProductID,Product:ProductID;EventTypeID,EventType:EventTypeID)
8=A7.groups(EventTypeID,ProductID.ProductTypeID;EventTypeID.EventType,ProductID.ProductTypeID.ProductType,sum(Quantity):Quantity,count(1):Num, icount(UserID):iNum)

A2 Retrieve records within the specified time range and where the event type is “submit order” and “search” from the composite table file and create a cursor based on them.

A3 Load dimension data from bin file EventType.btx, and set primary key and create index on it.

A4 Load dimension data from bin file ProdcutType.btx and set ProductTypeID as its primary key.

A5 Load dimension data from bin file Prodcut.btx and set ProductID as the primary key.

A6 Establish association between Product and ProductType.

A7 Associate A2’s cursor with in-memory dimension tables Product and EventType respectively.

A8 Perform grouping calculation to generate small result sets from A7’s cursor, the joining result.

When using switch() function to join with a dimension table, you need to set primary key for the dimension table in advance and perform join operation through the key. The join amounts to adding references of dimension table records in the associated field of the original table. Then you can reference any field of the dimension table using the syntax "field of the original table.field of the dimension table".

With hierarchical dimension tables, like table T – Product table – ProductType table in this instance, you can use the dot operator (.) to reference levels of dimension tables one by one, such as "ProductID.ProductTypeID.ProductType". It means that table T’s ProductID field references ProductTypeID field in its dimension table and then ProductType field in the dimension table’s dimension table.

3. Dimension tables are often used repeatedly. As they are generally not large, we can load them into the memory, establish associations and store each of them as a global variable. Then there is no need to load dimension tables and establish associations again for summarizations and just use the global variable directly. This way the above code can be regarded as including two parts. The first part is to load dimension tables as global variables at the startup of the server. The second part is the code for performing summarization.

Part one (execute once at the startup of the server):

A
1=file("EventType.btx").import@b().keys@i(EventTypeID)
2=file("ProductType.btx").import@b().keys@i(ProductTypeID)
3=file("Product.btx").import@b().keys@i(ProductID)
4>env(EventType,A1),env(ProductType,A2),env(Product,A3)
5>Product.switch(ProductTypeID,ProductType:ProductTypeID)

A4 Store each of the in-memory dimension tables as global variables for reference by the code for performing summarization.

Part two (summarization):

A
1>start=date("2022-03-15","yyyy-MM-dd"),end=date("2022-06-16","yyyy-MM-dd")
2=file("T.ctx").open().cursor(UserID,EventTypeID,ProductID,Quantity;Time>=start && Time<=end && (EventTypeID==5 || EventTypeID==3))
3=A2.switch(ProductID,Product:ProductID;EventTypeID,EventType:EventTypeID)
4=A3.groups(EventTypeID,ProductID.ProductTypeID;EventTypeID.EventType,ProductID.ProductTypeID.ProductType,sum(Quantity):Quantity,count(1):Num, icount(UserID):iNum)

Execution result:

EventTypeIDProductTypeIDEventTypeProductTypeQuantityNumiNum
31SearchFruits049958648735
32SearchHome&Personalcare050889749872
33SearchMeat040321339923
34SearchBeverage032456729045
35SearchBakery033549830234
51SubmitFruits20693810346913523
52SubmitHome&Personalcare46318815439614656
53SubmitMeat94378933668754
54SubmitBeverage217504543765233
55SubmitBakery339480678965844

Leave a Reply