Data mining, modeling and prediction in SPL

Data mining, modeling and prediction in SPL

With the aid of YModel, SPL is able to implement a fully automated data modeling and prediction. This article will teach you the steps of how to do.

I. Configure YModel

1. Download and install YModel

Download at: http://www.raqsoft.com/ymodel-download

Install YModel and record the installation directory, such as: C:\ProgramFiles\raqsoft\ymodel

..

2. Configure external library in SPL

(1) Copy files required for external library

Find the YModelCil and lib folders in the installation directory of YModel

..

Then, open the two folders, and find the files required for YModel external library, and copy these files to esProc installation directory ([root directory]\esProc\extlib\YModelCil), such as C:\ProgramFiles\raqsoft\esProc\extlib\YModelCli.

The files required for YModel external library are composed of the following two parts:

1) The YModelCil folder under YModel directory contains the following jar files and one xml file:

  • ant-1.8.2.jar
  • commons-beanutils-2.10.jar
  • commons-lang-2.6.jar
  • ezmorph-1.0.2.jar
  • json-lib-1.1-jdk13.jar
  • scu-ymodel-cli-2.10.jar
  • userconfig.xml

where, the core jar file of external library is scu-ymodel-cli-2.10 .jar.

2) The lib folder under YModel directory contains the following jar files:

  • commons-io-2.4.jar
  • esproc-ext-20221201.jar
  • fastjson-1.2.49.jar
  • gson-2.8.0.jar
  • jackson-annotations-2.9.6.jar
  • jackson-core-2.9.6.jar
  • jackson-databind-2.9.6.jar
  • jackson-databind-2.9.6-sources.jar
  • jackson-dataformat-msgpack-0.8.14.jar
  • mining.jar
  • msgpack-0.6.12.jar
  • msgpack-core-0.8.16.jar.

Note: the jars listed above are files that the third party depends on, which can be selected by users according to actual application environment.

(2) Set the parameters of userconfig.xml

Set the parameters in the userconfig.xml file under esProc’s directory esProc\extlib\YModelCli.

NameDescription
sAppHomeInstallation directory of YModel
sPythonHomePython path of YModel directoryWindows: raqsoft\ymodel\Python39Linux: raqsoft/ymodel/Python39/bin/python3.9
iPythonServerPortPython service network port
iPythonProcessNumberNumber of Python processes
bAutoDecideImputeDecide whether to impute intelligently
iResampleMultipleResampling times

The parameters that must be configurated are sAppHome and sPythonHome. For other parameters, you can leave the default value, and modify if necessary. For example, you can set as follows, the bold part must be configurated according to your own installation path.

<?xml version="1.0" encoding="UTF-8"?>
<Config Version="1">
<Options>
<Option Name="sAppHome" Value="C:\Program Files\raqsoft\ymodel"/>
<Option Name="sPythonHome" Value="C:\Program Files\raqsoft\ymodel\Python39\python.exe"/>
<Option Name="iPythonServerPort" Value="8510"/>
<Option Name="iPythonProcessNumber" Value="2"/>
<Option Name="bAutoDecideImpute" Value="true"/>
<Option Name="iResampleMultiple" Value="150"/>
</Options>
</Config>

(3) SPL environment configuration

a. Configure external library

Start SPL, and in Options menu, check YModelCli in Select external libraries to take it effect. The path of external library is the installation path of esProc YModelCli in step (1).

..

The path and name of external library are to be set in the file esProc\config\raqsoftConfig.xml, which is under the installation directory of esProc in the server without GUI.

<extLibsPath>external library path

<importLibs >external library name (multiple names allowed)

b. Set number of threads

If concurrency prediction is involved, it needs to set the “parallel limit” in SPL, i.e., the number of threads. Users can set according to their needs and machine condition.

..

The parallel limit is to be set in the file esProc\config\raqsoftConfig.xml, which is under the installation directory of esProc in the server without GUI.

<parallelNum>parallel limit

Up to this point, environment configuration is completed.

II. Modeling

1. Load data

SPL supports data in CSV file, excel file or database for modeling. Here we take csv as an example, the modeling methods for other data sources are similar.

Suppose that there is a loan default data table as follows, we want to build a model to predict whether or not new clients will default.

..

The file name is bank-full.csv;

A
1=file("bank-full.csv").import@tc()
2=ym_env()
3=ym_model(A2,A1)

A1: import the modeling data and read it as a table sequence;

..

A2: initialize environment. After A2is executed, it will generate a store directory and subdirectories under the installation directory of YModel to save data and result file;

A3: load the modeling file to generate the mdobject.

2. Set target variable and variable statistics

Once the data is loaded, the target variable needs to be set.

A
4=ym_target(A3,"y")
5=ym_statistics(A3,"age")
6=A1.fname().(ym_statistics(A3,~))

A4: means setting the field “y” as target variable, which can be a binary variable or numeric variable.

A5: view the statistics of a certain variable, such as “age”. In the returned result, you can see some parameters like missing rate, maximum value, minimum value, exception value, data distribution graph.

..

A6: loop through the name of variables, you can view the statistics of all fields, it will return a two-level sequence of statistics containing all fields.

..

3. Build a model and model performance

A
7=ym_build_model(A3)
8=ym_present(A7)
9=ym_performance(A7)
10=ym_importance(A7).sort@z(Importance)

A7: build a model with the modeling function. Once A7is executed, a fully automated data pre-processing and modeling process will be performed in backend. This process will take some time, the length of time depends on the amount of data. The returned result is a pdmodel object.

Having built the model, you can invoke the pdmodel object to view the information and quality of model, and importance degree of variables.

A8: return model’s AUC values and parameters

..

A9: return multiple model metrics and graphs, such as AUC, ROC, Lift.

..

For example, click on the value of the 6th record of A9, then click on the “Graph browse” icon in the upper right corner, and finally select “Lift” in the value field, you can view the Lift curve.

..

A10: return the degree of influence of each variable on the target variable, and sort them in descending order by importance. The larger the value, the greater the influence on target variable. Sorting in descending order make analysis more intuitive.

..

4. Save model

A
11=ym_save_pcf(A7,"bankfull.pcf")
12=ym_json(A7)
13>ym_close(A2)

A11: save the model as “bankfull.pcf”, the default save path is [sAppHome]/store/predict.
A12: return the model information as the form of jsonstring. For more information about json, refer to the on-line document JSON-style Parameter Guide at http://doc.raqsoft.com/YModel/jsonpara.
A13: close the environment to free resources.

III. Prediction

The pcf model file and prediction dataset should be available before prediction.

1. Data prediction

A
1=ym_env()
2=ym_load_pcf("bankfull.pcf")
3=file("bank-full2.csv").import@tc()
4=ym_predict(A2,A3)
5=ym_result(A4)
6=file("bank-full_result.csv").export@tc(A5)
7>ym_close(A1)

A1: initialize environment

A2: import pcfmodel file and generate pdmodel object

A3: import the to-be-predicted dataset and read it as table sequence

A4: perform prediction on the data of table sequence. In addition to table sequence, SPL supports the cursor, csv and mtx files, for example, A4can be written as ym_predict(A2, "bankfull2.csv")directly

A5: get the prediction result

A6: export the prediction result. In this example, the prediction result is the probability of client default

..

A7: close the environment to free resources.

2. Data prediction in application

1. Initialize the environment and load the model at system startup

A
1=ym_env()
2=ym_load_pcf(pcf_file)
…(Multiple models can be loaded)
3=env(YM,A1)
4=env(YM_Model_xxx,A2)

A1: initialize environment
A2: load pcfmodel file and generate pdmodel object
A3: set A1as global variable
A4: set A2as global variable

2. Perform data prediction

Execute the following statement through SPL in application:

ym_predict(YM_Model_xxx,pre_data)

In this way, the model YM_Model_xxx, which is loaded during initialization, can be used to predict and get the result. The pre_datain this statement is the externally transmitted data for prediction.

For the method of calling function in upper application, visit:

https://blog.scudata.com/spl-learning-materials/#Invocation-Intergration

This calling way will be executed immediately.

Since YModle is another application, communication between applications will occur when SPL calls YModle. As a result, there will be a communication action every time the prediction is performed. Frequent communication will cause significant delay in the case of high concurrency.

To solve delay problem, the ym_predictfunction adds a parameter:

ym_predict(YM_Model_xxx,pre_data,duration)

In this function, the durationmeans delaying the prediction. Suppose that the durationis set as 100, it indicates that the time is divided into 100-millisecond intervals. In this case, if a query is to be performed, it will wait until the time point of next 100 milliseconds. As a result, prediction will be performed after all requests within one time interval are aggregated.In this way, communication actions that would otherwise require multiple times are combined into one, avoiding frequent communication at high concurrency, and the delay of 100 ms at most is not obvious to front-end users, thus improving the user experience overall.

3. Close system to release the environment

A
1>ym_close(YM)

Leave a Reply