Looking for the Best Class Library for Computing XML Data

The advantage of XML format is its flexibility in representing data, and the disadvantage is the inconvenience of being computed. We need a class library to facilitate its computation. Now let’s examine several common XML class libraries to compare their syntactic expressiveness, deployment configurations, and data source support.


XML is a veteran data format. All languages provide class libraries for computing XML data. Java alone has a dozen of XML class libraries, including dom4j, JDOM, Woodstox, XOM,  Xerces-J , and Crimson, among which dom4j is the most developed. Below is an example that shows the class library’s syntactic expressiveness.

The file Employees_Orders.xml stores information of employees and their orders. Below is part of the source data:

<?xml version="1.0"  encoding="UTF-8"?>
 <State>"New  York"</State>
 <State>"New  Mexico"</State>

Use dom4j to find orders whose amounts are between 1000 and 3000 and whose clients contain “bro”. Below is Java code for doing this:

package  org.example;
import  org.dom4j.Document;
import  org.dom4j.Node;
import  org.dom4j.io.SAXReader;
import  java.util.List;
public class  App
  public  static void main (String\[\] args )throws Exception
  SAXReader  saxReader = SAXReader.createDefault();
  Document  doc =  saxReader.read("file:\\\D:\\\xml\\\Employees_Orders.xml");
  List<Node>  list=doc.selectNodes("/xml/row/Orders\[Amount>1000  and Amount<=3000  and contains(Client,'bro')\]")
  int  i=0;
   System.out.println("--------------count of the current   resultSet="+list.size());
   for(Node n:list){
   String OrderID=n.selectSingleNode("./OrderID").getText();
   String Client=n.selectSingleNode("./Client").getText();
   String SellerId=n.selectSingleNode("./SellerId").getText();
   String Amount=n.selectSingleNode("./Amount").getText();
   String OrderDate=n.selectSingleNode("./OrderDate").getText();

In the above code, /xml/row/Orders defines the query range; Amount>1000 and Amount<=3000 and contains(Client,'bro') defines the query condition (which is equivalent to the predicate in a sentence). The query syntax is part of the XPath (XQuery is a superset of it) and has a history of over two decades. XPath is concise, easy to understand and learn and boasts a wealth of functions that cater to a variety of conditional query needs. The mathematical functions abs and floor, string functions compare and substring, and date functions year-from-date and timezone-from-time are among the most commonly seen functions.

dom4j(XPath) has sufficiently strong syntactic expressiveness in handling conditional queries but it does not support handling sorting, distinct operation, grouping operation, aggregate operation, set-oriented operations and join operations. A whole data computation process always involves more than one type of operation. The class library is not an all-round player because it shows excellent performance on conditional queries only.

The class library gives a not so satisfactory support for data sources. It supports retrieving data from files only, yet the most seen XML data sources are WebService and HTTP.

It is convenient to deploy the dom4j (XPath). Users just need to add dom4j and jaxen to Maven.


Most long-lasting relational databases, including DB2, Oracle, MSSQL and MySQL, support XML data computation. Among them, MySQL is the most used in real-world business situations.

We can handle the previous conditional query using the following SQL+JAVA code:

package  org.example;
import  java.io.File;
import  java.io.FileInputStream;
import  java.sql.*;
public class  App
  public  static void main(String\[\] args) throws Exception, ClassNotFoundException {
   Connection conn = DriverManager.getConnection("jdbc:mysql://", "root", "runqian");
   Statement statement = conn.createStatement();
   statement.execute("drop table if exists testtable");
   statement.execute("CREATE TABLE testtable (testxml MEDIUMTEXT)   ENGINE=InnoDB DEFAULT CHARSET=UTF8");
   statement.execute("insert into testtable   values('"+readFile("D:\\\xml\\\Employees_Orders.xml")   +"')");
  String  conditionSQL="" +
   "with recursive old as (" +
   "select extractvalue(testxml,'/xml/row/Orders\[Amount>1000 and   Amount<=3000 and contains(Client,\\"bro\\")\]/OrderID')  oneLine1, " +
   "  extractvalue(testxml,'/xml/row/Orders\[Amount>1000  and  Amount<=3000 and contains(Client,\\"bro\\")\]/Client')  oneLine2, " +
   "   extractvalue(testxml,'/xml/row/Orders\[Amount>1000 and   Amount<=3000 and contains(Client,\\"bro\\")\]/SellerId')   oneLine3," +
   "   extractvalue(testxml,'/xml/row/Orders\[Amount>1000 and Amount<=3000   and contains(Client,\\"bro \\")\]/Amount')  oneLine4," +
   "   extractvalue(testxml,'/xml/row/Orders\[Amount>1000 and   Amount<=3000 and contains(Client,\\"bro\\")\]/OrderDate') oneLine5  " +
   " from testtable" +
   ")," +
   "N as (" +
   " select 1 as n" +
   " union select n + 1 from  N, old" +
   " where n <=  length(oneLine1) -  length(replace(oneLine1,' ',''))" +
   ")" +
   "select substring\_index(substring\_index(oneLine1,' ', n),' ',   -1) OrderID," +
   "  substring\_index(substring\_index(oneLine2,' ', n),'  ', -1) Client, " +
   "  substring\_index(substring\_index(oneLine3,' ', n),'  ', -1) SellerId, " +
   "  substring\_index(substring\_index(oneLine4,' ', n),'  ', -1) Amount, " +
   "  substring\_index(substring\_index(oneLine5,' ', n),'  ', -1) OrderDate " +
   "from N, old";
   ResultSet results = statement.executeQuery(conditionSQL);
   if  (conn != null)
  public  static void printResult(ResultSet rs) throws Exception{
   int  colCount=rs.getMetaData().getColumnCount();
   for(int i=1;i<colCount+1;i++){
   for (int i=1;i<colCount+1;i++){
  public  static String readFile(String fileName)throws Exception{
  File  file = new File(fileName);
  Long  fileLength = file.length();
  byte\[\]  fileContent = new byte\[fileLength.intValue()\];
   FileInputStream in = new FileInputStream(file);
  return  new String(fileContent, "UTF-8");

The above code is written in its logic. It creates testtable table in MySQL, reads in XML strings from Employees_Orders.xml, inserts each string to testtable as a record, and query the table using SQL. Below is part of the result set:

OrderID Client SellerId Amount OrderDate
49 "SPLI" 5 1050.6 "2010-09-03"
122 "SPL" 8 2527.2 "2009-12-02"
140 "OFS" 8 1058.4 "2010-12-18"

The SQL query part in the above Java code is the most difficult. extractvalue function is used to parse XML data. The function supports XPath query syntax and can join query results (such as all order dates) into a large space-separated string. To split the large string into smaller strings (each record corresponds to one order date, for instance), we need the complicated with statement.

The code does not split the XML data when implementing the conditional query. We can also implement the query by splitting the data. We split the XML file into the employee part and the order part, then split each part into records and import them into the database, and finally, perform the conditional query on the orders table. The SQL query can thus be significantly simplified but, at the same time, XML’s flexibility in representing data becomes insignificant.

Only SQL statements are used in the above code to achieve the conditional query. Actually, we can bring Java in to do the computation. The specific way is to parse XML using SQL and convert each record into N records with Java. This way, the SQL query also becomes simplified, but the hardest part is still there, only being transferred to Java. The high-level language, however, is not good at conditional query handling. It needs a second database write before doing the query, which adds extra processing workload.

Though it generates complicated code, MySQL has enough syntactic expressiveness to deal with most of the common computations. To group orders by year and sum order amounts in each group, for instance, MySQL has the following code:

with recursive  old as (
  select extractvalue(testxml,'/xml/row/Orders/OrderID') oneLine1,
           extractvalue(testxml,'/xml/row/Orders/Client')  oneLine2,
           extractvalue(testxml,'/xml/row/Orders/SellerId')  oneLine3,
           extractvalue(testxml,'/xml/row/Orders/Amount')  oneLine4,
           extractvalue(testxml,'/xml/row/Orders/OrderDate')  oneLine5
 from   testtable
N as (
  select  1 as n
  union  select n + 1 from N, old
 where n   &lt;= length(oneLine1) - length(replace(oneLine1, '',''))
query as(
  select substring\_index(substring\_index(oneLine1,  '', n),' ', -1) OrderID,
           substring\_index(substring\_index(oneLine2,  '', n),' ', -1) Client,
           substring\_index(substring\_index(oneLine3,  '', n),' ', -1) SellerId,
           substring\_index(substring\_index(oneLine4, '', n),' ', -1) Amount,
          STR\_TO\_DATE(substring\_index(substring\_index(oneLine5, '', n),' ',   -1),'"%Y-%m-%d"') OrderDate
  from  N, old)
select  year(OrderDate),sum(Amount) from query group  by year(OrderDate)

Here’s another instance. To join the employee table and the orders table and retrieve certain fields, MySQL produces more complicated code (as shown below) that involves recursive queries, which reduces efficiency.

with recursive  oldOrders as (
  select extractvalue(testxml,'/xml/row/Orders/OrderID') oneLine1,
          extractvalue(testxml,'/xml/row/Orders/Client')  oneLine2,
           extractvalue(testxml,'/xml/row/Orders/SellerId')  oneLine3,
           extractvalue(testxml,'/xml/row/Orders/Amount')  oneLine4,
           extractvalue(testxml,'/xml/row/Orders/OrderDate')  oneLine5
 from   testtable
N as (
  select  1 as n
  union select n + 1 from N, oldOrders
 where n   <= length(oneLine1) - length(replace(oneLine1, '',''))
Orders as(
  select substring\_index(substring\_index(oneLine1,  '', n),' ', -1) OrderID,
           substring\_index(substring\_index(oneLine2,  '', n),' ', -1) Client,
           substring\_index(substring\_index(oneLine3,  '', n),' ', -1) SellerId,
           substring\_index(substring\_index(oneLine4,   '', n),' ', -1) Amount,
          STR\_TO\_DATE(substring\_index(substring\_index(oneLine5, '', n),' ',   -1),'"%Y-%m-%d"') OrderDate
  from  N, oldOrders),
oldEmp as (
  select extractvalue(testxml,'/xml/row/EId') oneLine1,
           extractvalue(testxml,'/xml/row/Dept')  oneLine2,
           extractvalue(testxml,'/xml/row/Name')  oneLine3,
           extractvalue(testxml,'/xml/row/Gender')  oneLine4
 from   testtable),
N1 as (
  select  1 as n
  union  select n + 1 from N1, oldEmp
 where n   <= length(oneLine1) - length(replace(oneLine1, '',''))
Emp as(
  select substring\_index(substring\_index(oneLine1,  '', n),' ', -1) EId,
           substring\_index(substring\_index(oneLine2,  '', n),' ', -1) Dept,
           substring\_index(substring\_index(oneLine3,  '', n),' ', -1) Name,
           substring\_index(substring\_index(oneLine4,   '', n),' ', -1) Gender
  from  N1, oldEmp)
select  Orders.OrderID,Emp.Name from Orders,Emp where   Orders.OrderID=Emp.EId

MySQL has weak data source support by denying retrieval from Webservice and HTTP source and by hardcoding even the basic file data retrieval. It also needs to write the file data into the database for further computation.

It is convenient to deploy and configure MySQL. Users only need to introduce the driver jar.


Scala is a popular, widely used, excellent structured data computation language. It thus boasts plenty of third-party library functions. Spark and databricks are the two function libraries to compute XML data.

To achieve the previous conditional query, Scala has the following code:

package test
import  com.databricks.spark.xml.XmlDataFrameReader
import  org.apache.spark.sql.SparkSession
import  org.apache.spark.sql.functions._
object xmlTest  {
 def   main(args: Array\[String\]): Unit = {
  val spark  = SparkSession.builder()
  val df =  spark.read
   .option("rowTag", "row")
  val Orders  =   df.select(explode(df("Orders"))).select("col.OrderID","col.Client","col.SellerId","col.Amount","col.OrderDate")
  val  condition=Orders.where("Amount>1000 and Amount<=3000  and Client  like'%S%' ")

The code reads in the XML file as a multilevel DataFrame object, gets all orders using explode function, and performs the conditional query using where function.

You can also use Scala to group and summarize XML data. The code is as follows:

// Remove  extra quotation marks from both sides of each OrderDate value
val   ordersWithDateType= Orders.withColumn("OrderDate",   regexp_replace(col("OrderDate"), "\\"",""))  
val   groupBy=ordersWithDateType.groupBy(year(ordersWithDateType("OrderDate"))).agg(sum("Amount"))

To join the employee table and the orders table, Scala uses the following code:

val  df1=df.select(df("Name"),df("Gender"),df("Dept"),explode(df("Orders")))
val   relation=df1.select("Name","Gender","Dept","col.OrderID","col.Client","col.SellerId","col.Amount","col.OrderDate")

Scala has rather strong syntactic expressiveness in handling common computation, produces short and easy to understand code, and compared with MySQL, is easy to learn. When implementing a join operation, Scala can get target values directly from the multilevel XML data without creating two two-dimensional tables in advance. The logic is considerably simplified, the code becomes much shorter, and execution is efficient.

Scala code is short and easy to understand because DataFrame supports multilevel data handling and thus can represent XML structure conveniently. Its DataFrame-based functions are more convenient for computing multilevel data.

Scala offers excellent data source support. It has a special function to retrieve XML data from many data sources, including files,Webservice, HTTP, etc.

To deploy and configure Scala for XML data computation, users just need to introduce databricks and Spark (without the need of deploying Spark service) function libraries.

esProc SPL

esProc SPL is the professional open-source structured data computation language. It computes data coming from any data source using uniform syntax and data structure as Scala does. Yet it is lighter and has simpler syntax.

esProc SPL handles the previous conditional query in the following way:

3=A2.select(Amount>100 && Amount<=3000 && like@c(Client,"*bro*"))

SPL reads in XML data as a multilevel table sequence object (which is similar to Scala DataFrame), concatenates all orders via conj function, and performs the conditional query through select function.

This block of code can be debugged or executed in esProc IDE, or stored as a script file (like condition.dfx) for invocation from a Java program through the JDBC interface. Below is the code for invocation:

    package Test;  
  import java.sql.Connection;  
  import java.sql.DriverManager;  
  import java.sql.ResultSet;  
  import java.sql.Statement;  
  public class test1 {  
   public static void main(String\[\]  args)throws  Exception {  
   Connection connection = DriverManager.getConnection("jdbc:esproc:local://");  
   Statement statement = connection.createStatement();  
   ResultSet result = statement.executeQuery("call condition()");  
   if(connection != null)   connection.close();  

SPL achieves grouping and aggregation using the following code:


Or implements join operations using code below:


As the above code shows, SPL has the most powerful syntactic expressiveness that enables handling common operations, generates concise and easy to understand code, and facilitates loosely coupled integration with a Java program. The programming language has table sequence object to support multilevel data handling and gives intuitive support for operators to be able to retrieve values directly from multilevel data during a join, which further compresses the code.

With the powerful syntactic expressiveness, SPL can simplify multilevel XML data computation in most cases. Here is one example:

The file book1.xml stores book information, where the author node has author name and country attributes and where certain books have more than one author. Below is part of the XML file:

<?xml  version="1.0"?>
  <book  category="COOKING">
   <title>Everyday Italian</title>
   <author name="Giada De Laurentiis" country="it"   ></author>
   <info>Hello Italian!</info>
  <book  category="CHILDREN">
   <title>Harry Potter</title>
   <author name="J K. Rowling" country="uk"></author>
   <info>Hello Potter!</info>
  <book  category="WEB">
   <title>XQuery Kick Start</title>
  <author name="James  McGovern" country="us"  ></author>
  <author name="Per  Bothner" country="us"></author>
   <info>Hello XQuery</info>
  <book  category="WEB">
   <title>Learning XML</title>
   <author name="Erik T. Ray" country="us"></author>
   <info>Hello XML!</info>

The code rearranges XML data into a structured two-dimensional table, where each author field value is displayed in the format of “author name[country] and where comma is used to separate information of multiple authors, and query the table to get information of books published in the year 2005. Below is the result set:

Everyday ItalianCOOKING2005Giada De Laurentiis[it]Hello Italian!
Harry PotterCHILDREN2005J K. Rowling[uk]Hello Potter!
XQuery Kick StartWEB2005James McGovern[us],Per Bothner[us]Hello XQuery

The computing task is not easy, but SPL makes it easier using the following code:


SPL provides great data source support. It has the special function to retrieve XML data from a variety of data sources, including files, WebService, and HTTP, etc.

XML data read and write is one of SPL’s basic features, so users do not need to make specific deployment. To integrate an SPL script into a Java program, you just need to introduce the relevant jars, and that is convenient.

In a nutshell, esProc SPL has the most powerful syntactic expressiveness that can simplify multilevel XML data computations; Scala has good syntactic expressiveness that can handle common operations; MySQL has enough expressive ability, but the code is too complicated, except for scenarios when the XML file has relatively simple structure and can be split and written into the database; dom4j has a too weak expressiveness ability to handle common computations; it is only suitable for dealing with pure conditional queries. In the aspect of data source support, both esProc SPL and Scala provides a wealth of choices. About deployment, both dom4j and MySQL are simple, yet the other two are also easy.

Leave a Reply