Sample Programs of Structuralizing Excel Files

Sample Programs of Structuralizing Excel Files

An Excel file needs to be parsed and structuralized before it is computed or loaded into a relational database. Often we have an Excel file with non-standardized format and there are various file structures. The program for structuralizing an Excel file thus becomes complicated and bloated. It’s also difficult to migrate. You need to do the parsing from scratch for each type of format or structure before programming.

This article illustrates how to parse Excel files of different formats and sizes, including the common row-wise format, complicated table headers, free-style format, crosstab, main & sub tables, and big files, into structured data, and provides sample programs in esProc SPL. esProc is intended as a professional data computing engine that is based on Structured Process Language (SPL). SPL boasts a complete set of Excel file handling functions to structuralize and process Excel files and to write the result set to the database.

1. Common row-wise format

This is the simplest Excel file format. Each row is a record, and in most cases the first row contains column headers.

Example: Based on the students score file scores.xlsx, calculate the average Chinese score for each class. Below is part of the file:

esProc SPL script:

AComment
1=file(“e:/excel/scores.xlsx").xlsimport@t()Read the file; @t option read the 1st row as column headers
2=A1.groups(Class;avg(Chinese):avg_Chinese)Group rows by Class and calculate the average Chinese score of each class
3=file("e:/excel/class_avg_c.xlsx").xlsexport@t(A2)Export result set to a new Excel file

2. Complicated table headers

In many real-world cases, Excel files have complicated table headers that occupy several rows, including table title, project name, person who completes the form, date, page information, etc. To parse such an Excel file, you need to skip the table headers, specify a row where the reading begins and give names to the structuralized columns.

Example: Based on the project price list itemPrices.xlsx, calculate the total project price. Below is part of the file:

esProc SPL script:

AComment
1=file(“e:/excel/itemPrices.xlsx").xlsimport(;1,5)Parameters “1,5” means reading data in sheet 1 through to the end starting from the 5th row
2=A1.rename(#1:No,#2:ItemCode,#3:ItemName,#4:Unit,#5:Quantity,#6:Price,#7:Sum)Rename columns
3=A2.sum(Sum)Get the total price

3. Free style format

In a free style Excel file, each record corresponds to multiple rows; cells containing column values are directly under or on the right of the column name cell; and there may be cross-row or cross-column merged cells. What we know for sure is that the number and structure of rows that form each record are the same. When reading the file by loop, get each record by the number of rows they occupy.

Example: Load the free style Excel file employee.xlsx that stores employee information into database table employee. Below is part of the file:

esProc SPL script:

ABC
1=create(ID,Name,Sex,Position,Birthday,Phone,Address,PostCode)
2=file(“e:/excel/employe.xlsx").xlsopen()
3[C,C,F,C,C,D,C,C][1,2,2,3,4,5,7,8]
4for=A3.(~/B3(#)).(eval($[A2.xlscell(]/~/")"))
5if len(B4(1))==0break
6>A1.record(B4)
7>B3=B3.(~+9)
8=connect(“db”)
9=A8.update(A1,employee)
10=A8.close()

A1 Create an empty table sequence consisting of columns “ID, Name, Sex, Position, Birthday, Phone, Address and PostCode”.

A2 Open the Excel file

A3 Define a sequence of column numbers commanding the cells containing employee information

B3 Define a sequence of row numbers commanding the cells containing employee information

A4 Retrieve information of each employee using for statement

B4 A3.(~/B3(#)) gets the sequence of cells holding information of the current employee and then reads the cell values to form a sequence of information. In the first round of loop, the cells are [C1,C2,F2,C3,C4,D5,C7,C8]. In the second round of the loop, the cells are [C10,C11,F11,C12,C13,D14,C16,C17]; and so on. Add 9 to the current row number for each round of loop. $[A2.xlscell(] is equivalent to "A2.xlscell(". Both expression represent the same string. The former allows automatic adjustment of the cell it contains while the latter enclosed by the quotation marks does not. By inserting a row before the row holding A2, for example, the first expression will automatically become $[A3.xlscell(], while the second one will remain unchanged.

B5 Check whether the current employee ID is null. Exit the loop and stop data retrieval if it is null.

B6 Append an employee record to A1’s table sequence

B7 Add 9 to each of the sequence numbers of the rows holding the current employee information to retrieve information of the next employee

A8-A10 Connect to the database to load the employee information to database table employee and close the connection

Below is the data retrieved by A1:

4. Crosstab

Commonly seen in statistics, a crosstab is a matrix-type table that displays the quantitative relationship between two variables. A column to row/row to column transposition between a column/row variable and the crossed value will be performed according to another column/row variable after the crosstab data is retrieved row by row.

Example: Parse crosstab cross.xlsx,where the orders area and the shipping type are crossed, into structured data. Below is part of the file:

esProc SPL script:

AComment
1=file(“e:/excel/cross.xlsx").xlsimport@t(;1,2)Read in sheet 1 of the Excel file starting from the 2nd row; use 2 as the column name
2=A1.rename(#1:Type)Rename column 1 Type
3=A2.pivot@r(Type;Area,Amount)Perform transposition over A2’s table sequence; @r option enables a column to row transposition; the new column names after transposition are Area and Amount

Below is part of A3’s data:

5. Main & and sub tables

In an Excel file, if one sheet is a record that contains a number of (N) sub-records, that forms the main and sub tables. The number of the main table records is that of the sheets in the Excel file. To retrieve data from such type of Excel files, we need to create two tables to respectively store records of the main table and those of the sub table.

Example: In staff information file staff.xlsx, each sheet stores information of staff and their families. We need to parse the file into two structured data tables that respectively store the staff information and family information. Below is one of the Excel sheets:

esProc SPL script:

ABC
1=create(IDCard,Name,Sex,Birthday,Nation,Phone,Depart,Home,Marital,Entry)
2=create(IDCard,Name,Relation,Workplace,Phone)
3[B4,B3,D3,F3,H3,F4,H4,B5,F5,H5]
4=file(“e:/excel/staff.xlsx").xlsopen()
5for A4=A3.(eval($[A4.xlscell(]/~/",\""/A5.stname/"\")"))>A1.record(B5)
6=A4.xlsimport@t(Family,Name,Relation,Workplace,Phone;A5.stname,6)
7=B6.rename(Family:IDCard)>B7.run(IDCard=B5(1))
8>A2.insert@r(0:B7)

A1 Create an empty table consisting of fields IDCard, Name, Sex, Birthday, Nation, Phone, Depart, Home, Marital and Entry to store the staff information of the main table

A2 Create an empty table made up of fields IDCard, Name, Relation, Workplace and Phone to store the family information in the sub table

A3 Define a sequence of cells holding the staff information in the main table

A4 Open the Excel file

A5 Read each sheet of the Excel file by loop

B5 Read staff information in each sheet as a sequence

C5 Store B5’s staff information in A1’s table sequence

B6 Read family information starting from the 6th row; only 5 columns (Family, Name, Relation, Workplace and Phone) are read

B7 Change the name of column Family in B6’s table sequence into IDCard

C7 Assign IDCard values in the main table to column IDCard in B7’s table sequence

B8 Append family information in B7 to A2’s table sequence

Below is part of A1’s data:

Below is part of A2’s data:

6. Big Excel files

There are detailed explanations and examples of the parsing, structuralization and computation of big text files in Big Structured Text File Processing & Sample Programs. Here we illustrate it again with an Excel file:

Example: Based on orders information table orders.xlsx – which is big, calculate the total orders amount for each area. Below is part of the file:

esProc SPL script:

AComment
1=file(“e:/excel/orders.xlsx").xlsimport@tc()Read the Excel file; @t option enables reading row 1 as the column headers; @c option enables returning a cursor
2=A1.groups(Area;sum(Amount))Group rows by Area and sum orders amounts in each area

Find more examples in SPL CookBook.

Leave a Reply