Although AI modeling and prediction in Python is popular, SPL is also a good alternative to get started

Although AI modeling and prediction in Python is popular, SPL is also a good alternative to get started

There are many tools that can be used for AI modeling and prediction, such as Python, R, SAS and SPSS. In these tools, Python is very popular because it is simple, easy to learn, rich in data science libraries, open source and free. However, modeling in Python is still complicated for programmers who are not very familiar with data modeling algorithms. In many cases, they have no idea to start and do not know which algorithm to choose when they have data. In fact, SPL is also a good choice when dealing with a data analysis and modeling task since SPL is simpler and easier to use than Python and fast in calculation speed. In addition, SPL provides an interactive interface that is very friendly to data analysis, and also provides easy-to-use automated data modeling functionality and some data processing and statistical functions.

This article will teach you the detailed steps to build data model and predict in SPL by taking a data table of client loan default prediction as an example.

1. Determine an objective, prepare data

Data modeling and prediction are to mine historical data to find the regular pattern, and then utilize the pattern to predict what might happen in the future. This pattern is what is generally referred to as a model.

Historical data is usually what we commonly call the wide table. For instance, the historical data of client loan default prediction is an Excel table as shown below:

..

To build data model and predict, the prerequisite is that the data (wide table) must include what we want to predict, usually called the prediction target. The prediction target in this example is the default behavior of historical clients, that is, the ycolumn. In this column, yesmeans default occurred and nomeans default did not occur. The prediction target can also be a value (such as a product's sales and price), or a category, for example, you may want to know what category the quality of to-be-predicted target falls into (excellent, good, acceptable or poor). Sometimes the targets are available in raw data and can be got directly, while sometimes the targets need to be annotated manually.

In addition to the prediction target, a lot of information is needed for prediction, such as client's age, job, housing and loan in the table. Each column is called a variable, which is the information associated with whether the lender will default in the future. In principle, the more variables that can be collected, the higher the prediction accuracy. For example, to predict whether a customer will buy a product, you need to collect customer's behavior information, shopping preference, product feature information and promotion strength, etc.; to predict the claim risk of vehicle insurance, you need to collect the policy data, vehicle information, vehicle owner's traffic habits and historical claim information, etc.; to predict a health insurance, you need to collect information about the insurant' living habits, physical condition, medical treatment and medical visit and so on; to predict the sales of a mall/supermarket, you need to collect historical sales orders, customer information and product information; to predict defective products, you need to collect data such as production process parameters, environment and raw material condition. In short, the more relevant information is collected, the better the prediction effect will be.

When collecting data, it is common to clip out the historical data of a certain period to create a wide table. For example, if you want to predict client default situation of July, you can collect the data of January to June to build a model. The time range is not fixed, and can be chosen flexibly, for example, you can choose a time period of last 1 year or the last 3 months.

The ready wide table can be saved as xls or csv format, with the first row being the title and each subsequent row being a history record.

If a business has constructed the information system, then you can ask the IT department for relevant data. Many businesses can export such data directly from their BI system.

2. Download software, configure YModel external library

With the aid of YModel external library, SPL is able to provide a fully automated data modeling and prediction functionality.

(1). Download and install esProc (SPL) and YModel

Download esProc at: http://c.raqsoft.com/article/1595817756260

Download YModel at: http://www.raqsoft.com/ymodel-download

Install esProc and YModel, and record the installation directory, such as: C:\Program Files\raqsoft\ymodel

..

(2) Configure external library in SPL

(a) Copy files required for external library

Find the YModelCil and lib folders in the installation directory of YModel.

..

Then, open the two folders, and find the files required for YModel external library, and copy these files to esProc installation directory ([root directory]\esProc\extlib\YModelCil), such as C:\Program Files\raqsoft\esProc\extlib\YModelCli.

The files required for YModel external library include:

1>The YModelCil folder under YModel directory contains the following jar files and one xml file:

  • ant-1.8.2.jar
  • commons-beanutils-2.10.jar
  • commons-lang-2.6.jar
  • ezmorph-1.0.2.jar
  • json-lib-1.1-jdk13.jar
  • scu-ymodel-cli-2.10.jar
  • userconfig.xml

2>The lib folder under YModel directory contains the following jar files:

  • commons-io-2.4.jar
  • esproc-ext-20221201.jar
  • fastjson-1.2.49.jar
  • gson-2.8.0.jar
  • jackson-annotations-2.9.6.jar
  • jackson-core-2.9.6.jar
  • jackson-databind-2.9.6.jar
  • jackson-databind-2.9.6-sources.jar
  • jackson-dataformat-msgpack-0.8.14.jar
  • mining.jar
  • msgpack-0.6.12.jar
  • msgpack-core-0.8.16.jar

(b) Set the parameters of userconfig.xml

Set the parameters in the userconfig.xml file under esProc’s directory esProc\extlib\YModelCli.

NameDescription
sAppHomeInstallation directory of YModel
sPythonHomePython path of YModel directoryWindows: raqsoft\ymodel\Python39Linux: raqsoft/ymodel/Python39/bin/python3.9
iPythonServerPortPython service network port
iPythonProcessNumberNumber of Python processes
bAutoDecideImputeDecide whether to impute intelligently
iResampleMultipleResampling times

The parameters that must be configurated are sAppHome and sPythonHome. For other parameters, you can leave the default value, and modify if necessary. For example, you can set as follows, the bold part must be configurated according to your own installation path.

<?xml version="1.0" encoding="UTF-8"?>
<Config Version="1">
<Options>
<Option Name="sAppHome" Value="C:\Program Files\raqsoft\ymodel"/>
<Option Name="sPythonHome" Value="C:\Program Files\raqsoft\ymodel\Python39\python.exe"/>
<Option Name="iPythonServerPort" Value="8510"/>
<Option Name="iPythonProcessNumber" Value="2"/>
<Option Name="bAutoDecideImpute" Value="true"/>
<Option Name="iResampleMultiple" Value="150"/>
</Options>
</Config>

In fact, it can be seen that YModel is also written based on Python, but it encapsulates the algorithms of Python, therefore, programmers do not have to understand the mathematical principle and operation details of these algorithms.

(c) SPL environment configuration

1>. Configure external library

Start SPL, and in Options menu, check YModelCli in Select external libraries to take it effect. The path of external library is the installation path of esProc YModelCli in step (1).

..

The path and name of external library are to be set in the file esProc\config\raqsoftConfig.xml, which is under the installation directory of esProc in the server without GUI.

<extLibsPath> external library path

< importLibs > external library name (multiple names allowed)

2>. Set number of threads

If concurrency prediction is involved, it needs to set the “parallel limit” in SPL, i.e., the number of threads. Users can set according to their needs and machine condition.

..

The parallel limit is to be set in the file esProc\config\raqsoftConfig.xml, which is under the installation directory of esProc in the server without GUI.

<parallelNum> parallel limit

Up to this point, environment configuration is completed.

3. Modeling and prediction

(1) Load data

SPL supports data in CSV file, excel file or database for modeling. Here we take csv as an example, the modeling methods for other data sources are similar.

Suppose that there is a loan default data table as follows, we want to build a model to predict whether or not new clients will default.

..

The file name is bank-full.csv;

A
1=file("bank-full.csv").import@tc()
2=ym_env()
3=ym_model(A2,A1)

A1: import the modeling data and read it as a table sequence;

..

A2: initialize environment. After A2is executed, it will generate a store directory and subdirectories under the installation directory of YModel to save data and result file;

A3: load the modeling file to generate the mdobject.

(2) Set target variable and variable statistics

Once the data is loaded, the target variable needs to be set.

A
4=ym_target(A3,"y")
5=ym_statistics(A3,"age")
6=A1.fname().(ym_statistics(A3,~))

A4: means setting the field “y” as target variable, which can be a binary variable or numeric variable.

A5: view the statistics of a certain variable, such as “age”. In the returned result, you can see some parameters like missing rate, maximum value, minimum value, exception value, data distribution graph.

..

A6: loop through the name of variables, you can view the statistics of all fields, it will return a two-level sequence of statistics containing all fields.

..

(3) Build a model and model performance

A
7=ym_build_model(A3)
8=ym_present(A7)
9=ym_performance(A7)
10=ym_importance(A7).sort@z(Importance)

A7: build a model with the modeling function. Once A7is executed, a fully automated data pre-processing and modeling process will be performed in backend. This process will take some time, the length of time depends on the amount of data. The returned result is a pdmodel object.

Having built the model, you can invoke the pdmodel object to view the information and quality of themodel, and the importance degree of variables.

A8: return model's AUC values and parameters

..

A9: return multiple model metrics and graphs, such as AUC, ROC, Lift.

..

For example, click on the value of the 6th record of A9, then click on the “Graph browse” icon in the upper right corner, and finally select “Lift” in the value field, you can view the Lift curve.

..

A10: return the degree of influence of each variable on the target variable, and sort them in descending order by importance. The larger the value, the greater the influence on target variable. Sorting in descending order make analysis more intuitive.

..

(4) Save model

A
11=ym_save_pcf(A7,"bankfull.pcf")
12=ym_json(A7)
13>ym_close(A2)

A11: save the model as “bankfull.pcf”, the default save path is [sAppHome]/store/predict.

A12: return the model information as the form of jsonstring. For more information about json, refer to the on-line document JSON-style Parameter Guide at http://doc.raqsoft.com/YModel/jsonpara.

A13: close the environment to free resources.

(5) Data prediction

The pcf model file and prediction dataset should be available before prediction.

A
1=ym_env()
2=ym_load_pcf("bankfull.pcf")
3=file("bank-full2.csv").import@tc()
4=ym_predict(A2,A3)
5=ym_result(A4)
6=file("bank-full_result.csv").export@tc(A5)
7>ym_close(A1)

A1: initialize environment

A2: import pcfmodel file and generate pdmodel object

A3: import the to-be-predicted dataset and read it as table sequence

A4: perform prediction on the data of table sequence. In addition to table sequence, SPL supports the cursor, csv and mtx files, for example, A4can be written as ym_predict(A2, "bankfull2.csv")directly

A5: get the prediction result

A6: export the prediction result. In this example, the prediction result is the probability of client default

..

A7: close the environment to free resources.

4. Integrating and calling

SPL can be integrated into and called by upper-layer application. For example, SPLcan be embedded in Java application.

Summary

It is very simple to build data model and predict in SPL and YModel, eliminating the need for programmers to understand profound mathematical principle, and thus a data modeling task can be done in just a few simply steps as long as a training data is prepared. Moreover, this functionality can be easily embedded in applications. Therefore, the complex AI is no longer the preserve of a small number of data scientists.

SPL itself has super-strong data processing ability, which makes it more conveniently to accomplish data preparation before performing AI algorithms. In addition, SPL provides a wealth of math functions. Those who have some math knowledge and want to implement data modeling on their own can study further.

Leave a Reply