User Behavior Analysis in Practice 7: Dimension Table Filtering

User Behavior Analysis in Practice 7: Dimension Table Filtering

Target task

We have a user events table T. Below is its structure and part of its data:

TimeUserIDProductIDQuantity
2022/6/1 10:20107275517
2022/6/1 12:12107803028
2022/6/1 12:36100509333
2022/6/1 13:21104865549
2022/6/1 14:46103782455
2022/6/1 15:19104962664
2022/6/1 16:00100929676
2022/6/1 16:39107071387
2022/6/1 17:40109088494

Fields in table T:

Field nameData typeDescription
TimeDatetimeTime stamp of an event, accurate to milliseconds
UserIDStringUser ID
ProductIDIntegerProduct ID
QuantityNumericQuantity

Dimension table Product:

ProductIDProductNameUnitPriceProductTypeID
1ApplePound5.51
2TissuePacks162
3BeefPound353
4WineBottles1204
5PorkPound253
6BreadPacks105
7JuiceBottles64

Fields of dimension table Product:

Field nameData typeDescription
ProductIDStringProduct ID
ProductNameStringProduct name
UnitStringSales unit
PriceNumericUnit price
ProductTypeIDIntegerProduct type ID

Computing task:

Find the number of sold pieces for each product under type IDs 1, 2 and 3 within a specified time period.

Techniques involved:

1. Filtering dimension table before join

We first filter the dimension table, and then join it with the fact table. Thereisnoneedtojudgethespecifiedconditionondimensiontableforfacttablerecordsthatdonotmatchthefiltereddimensiontable. In this way the number of comparisons will decrease considerably.

2. Performing join in cursor

Join the fact table’s cursor with the filtered dimension table, during which fact table records that cannot associate with the dimension table will not be generated. This further reduces time spent in generating records in the joining result set.

3. Index reuse

Use the existing index on the filtered dimension table. When the filtering result set is still large, which means only a small number of records are filtered away, the technique saves your time for re-creating the index.

4. Alignment sequence

We can use an efficient alignment sequence to filter a numberized dimension table by generating a same-length table sequence, whose member values are whether the corresponding dimension records meet the filtering condition. When joining the dimension table with the fact table, we can directly find whether a fact table record matches or not according to its ordinal number. This saves time for performing the join and comparisons.

Sample code

Suppose T.ctx and Product.btx are already created as above introduces. T.ctx is ordered by Time and Product.btx is ordered by ProductID.

1. Filter dimension table Product and join it with T; delete records from table T that do not have matching records in the dimension table during joining.

A
1>Product=file("Product.btx").import@b().select([1,2,3].pos(ProductTypeID)!=null).keys@i(ProductID)
2>start=date("2022-03-15","yyyy-MM-dd"),end=date("2022-06-16","yyyy-MM-dd")
3=file("T.ctx").open().cursor(ProductID,Quantity;Time>=start && Time<=end)
4=A3.switch@i(ProductID,Product:ProductID)
5=A4.groups(ProductID; ProductID.ProductName, sum(Quantity):Quantity)

A1 Import the dimension table and filter it, and set index on the primary key.

A4 Join the fact table and the dimension table; @i option enables deleting records from the fact table that do not match the dimension table.

2. Move the association action in step 1 to the cursor generation statement, during which fact table records that cannot match won’t be generated in the cursor.

A
1>Product=file("Product.btx").import@b().select([1,2,3].pos(ProductTypeID)!=null).keys@i(ProductID)
2>start=date("2022-03-15","yyyy-MM-dd"),end=date("2022-06-16","yyyy-MM-dd")
3=file("T.ctx").open().cursor(ProductID,Quantity;Time>=start && Time<=end,ProductID:Product)
4=A3.groups(ProductID; ProductID.ProductName, sum(Quantity):Quantity)

A3 Move the action of association between the fact table and the dimension table to the cursor generation statement; fact table records that are not matched won’t be generated in the result set.

3. Use the existing index on filtered dimension table Product

A
1>Product=file("Product.btx").import@b().keys@i(ProductID)
2=Product.select@i([1,2,3].pos(ProductTypeID)!=null)
3>start=date("2022-03-15","yyyy-MM-dd"),end=date("2022-06-16","yyyy-MM-dd")
4=file("T.ctx").open().cursor(ProductID,Quantity;Time>=start && Time<=end,ProductID:A2)
5=A4.groups(ProductID; ProductID.ProductName, sum(Quantity):Quantity)

A1 Import dimension table Product and set index on primary key.

A2 @i option enables reusing the existing index on the filtered dimension table.

Index reuse does not always make the computation faster. As records that are filtered away need to be deleted from the index table, it takes some time to perform the delete action when there are a lot of records are discarded (the number of desired records is small). In this case re-creating the index may be faster. It is important that we choose an appropriate method according to the actual situation.

4. Use alignment sequence as primary key values of dimension table Product are ordinal numbers

A
1>Product=file("Product.btx").import@b()
2Product.([1,2,3].pos(ProductTypeID))=
3>start=date("2022-03-15","yyyy-MM-dd"),end=date("2022-06-16","yyyy-MM-dd")
4=file("T.ctx").open().cursor(ProductID,Quantity;Time>=start && Time<=end)
5=A4.select(A2(ProductID))
6=A5.groups(ProductID; Product(ProductID).ProductName, sum(Quantity):Quantity)

A1 Import dimension table Product without setting index on the primary key.

A2 Generate an alignment sequence according to Product table. Each of its values is whether the current record meets the filtering condition or not.

A5 Directly locate the target member in A2 according to table T’s ProductID to find out whether the current record satisfies the filtering condition.

5. Move the alignment-sequence-based filtering action to the cursor generation statement

A
1>Product=file("Product.btx").import@b()
2Product.([1,2,3].pos(ProductTypeID))=
3>start=date("2022-03-15","yyyy-MM-dd"),end=date("2022-06-16","yyyy-MM-dd")
4=file("T.ctx").open().cursor(ProductID,Quantity;Time>=start && Time<=end && A2(ProductID))
5=A4.groups(ProductID; Product(ProductID).ProductName, sum(Quantity):Quantity)

A4 Move the alignment-sequence-based filtering action to the cursor generation statement. Records that do not meet the filtering condition won’t be generated.

Execution result:

ProductIDProductNameQuantity
1Apple206938
2Tissue463188
3Beef94378
5Pork217504

Leave a Reply