Home Partners Company

Performance and Optimization Tips for XQuery

Preparing Queries

Download the PDF

The first time a query is executed, a significant percent of the total query time is spent preparing the query, which includes query optimization and generating the query plan. Because most programs invoke queries more than once, we spend the time needed to find an efficient way to process the query, even though that makes the first query slower. Fortunately, this effort does not need to be repeated. One way to ensure that a query is prepared only once in your program is to use a prepared query. (Another way is to use Query Pooling, described in the following section).

To parameterize a query, use external variables to change the values used for the query each time it is invoked. For instance, the following query creates a portfolio for a given user:

Example 1. A Query with an External Variable

declare variable $user as xs:string external;
collection('holdings')/holdings[userid=$user]

In this query, $user is an external variable which must be bound before executing the query. The following XQJ code shows how to prepare the query and bind a value to $user.

Example 2. Preparing a Query

// Get a connection, prepare the query
XQDataSource dataSource = new DDXQDataSource();
dataSource.setJdbcUrl("jdbc:xquery:sqlserver://server1:1433;databaseName=stocks");
XQConnection connection = dataSource.getConnection();
XQPreparedExpression preparedExpression = connection.prepareExpression(xqueryText);
// Bind variable $user to 'Jonathan' and execute
preparedExpression.bindString(new QName("user"), "Jonathan");
XQSequence xqSequence = preparedExpression.executeQuery();

The following code shows how to bind $user to a different value and execute prepared query again.

Example 3. Executing a Prepared Query with New Values

// Bind variable $user to 'Minollo' and execute
preparedExpression.bindString(new QName("user"), "Minollo");
xqSequence = preparedExpression.executeQuery();

Benchmarks that measure the speed of a query should mirror the usage you expect in your program. Since most programs execute a query many times, when you design a benchmark you should prepare your query, then execute it many times.

Query Pooling

Download the PDF

Prepared expressions let the programmer decide when a query should be prepared and when a prepared query is no longer needed. In programs that use hundreds or thousands of queries, it is generally better to let DataDirect XQuery® keep track of which queries are actually being used, preparing them the first time they are executed, and discarding the least frequently used prepared queries if there are too many queries. This is called query pooling.

In DataDirect XQuery®, you enable query pooling by specifying the number of prepared queries to hold in the pool. If you configure your connection by setting properties in the Java API, you can set the MaxPooledQueries property to do this; for instance, the following code sets the size of the query pool to 20:

Example 4. Enabling Query Pooling with the Java API

XQDataSource dataSource = new DDXQDataSource();
dataSource.setJdbcUrl("jdbc:xquery:sqlserver://server1:1433;
databaseName=stocks");
dataSource.setMaxPooledQueries(20);
XQConnection connection = dataSource.getConnection();

If you configure your connection using a Source Configuration File, you can set the size of the query pool using the maxPooledQueries element:

Example 5. Enabling Query Pooling in a Source Configuration File
<?xml version="1.0" encoding="UTF-8"?>
<XQJConnection xmlns="http://www.datadirect.com/xquery">
    <maxPooledQueries>20</maxPooledQueries>
    <JDBCConnection name="stocks">
<url>jdbc:xquery:sqlserver://localhost:1433</url>
<sqlxmlMapping>
<forest>true</forest>
<identifierEscaping>none</identifierEscaping>
</sqlxmlMapping>
</JDBCConnection>
</XQJConnection>

Querying Large XML Files


DataDirect XQuery® provides several ways to query an XML file. In DataDirect XQuery®, you will get better performance, especially for large XML files, if you use fn:doc() to access XML in a query. You will get much worse performance if you parse the XML to create a DOM tree, bind the DOM tree to an external variable, and query the external variable.

In most XQuery processors, a large percent of the time needed to query a large XML file is spent parsing the file and creating an in-memory representation that can be queried. The in-memory representation may be many times the size of the original XML file, and Java VM out of memory errors may occur at run time. When your query addresses an XML file using fn:doc(), DataDirect XQuery® uses a technique known as Document Projection, which allows it to create only the part of a document needed by the query. This results in dramatic improvements in memory usage and scalability, and significant improvements in performance. Using document projection, you can generally query documents many times the size of available memory. Incidentally, if your application uses XML Deployment Adapters, it addresses these adapters using fn:doc(), and DataDirect XQuery® uses Document Projection appropriately for these adapters just as it does for XML documents.

Document projection uses the path expressions in a query to determine what parts of a document need to be built. Path expressions that use wildcards ('*') provide less information for document projection, so they should be avoided when querying large documents. For instance, in the examples below, the second path expression will perform much better for large documents.

Example 6. Avoid path expressions with wildcards
doc("scenario.xml")//*[role='teacher']

Example 7. Specific path expressions aid document projection
doc("scenario.xml")/scenario/people/person[role='teacher']

If you parse a document to create a DOM tree, then bind the DOM tree to an external variable, you bypass DataDirect's Document Projection. This forces your program to create the entire in-memory representation of your XML document — not just the part that DataDirect XQuery® needs to process your queries. In addition to the savings due to Document Projection, DataDirect's internal representation of an XML document is much more efficient than most DOM implementations, so using fn:doc() gives you two substantial optimizations that you lose if you parse the document yourself.

DataDirect XQuery® also supports a second technique known as Streaming. Streaming processes a document sequentially, discarding portions of the document that are no longer needed to produce further query results. This reduces memory usage because only the portion of a document needed at a given stage of query processing is instantiated in memory. Streaming can reduce memory consumption, but generally involves a (relatively minor) performance penalty. Unlike document projection, streaming is not generally a performance win. Also, some queries require much of the document to be instantiated at a given time, diminishing the benefits of streaming. For some queries, however, streaming can be a substantial win, especially if the query results exceed available memory; therefore, we allow streaming to be enabled, but we do not enable it by default.

To enable streaming, set the ddtek:xml-streaming option to 'yes' in the XQuery Prolog:

Example 8. Enabling Streaming in the Query Prolog
declare option ddtek:xml-streaming 'yes';

You can also enable streaming at the DataSource level, which means streaming will be used for all queries that use the DataSource. One way to do this is to set the Pragmas property of the DDXQDataSource class.

Example 9. Enabling Streaming for a DataSource in the Java API
XQDataSource dataSource = new DDXQDataSource();
dataSource.setJdbcUrl("jdbc:xquery:sqlserver://server1:1433;
databaseName=stocks");
dataSource.setPragmas("xml-streaming=yes");

Another way to enable streaming at the DataSource level is to specify the declaration option using a pragma element in a Source Configuration File:

Example 10. Enabling Streaming for a DataSource in a Source Configuration File
<?xml version="1.0" encoding="UTF-8"?>
<XQJConnection xmlns="http://www.datadirect.com/xquery">
  <maxPooledQueries>20</maxPooledQueries>
  <pragma name="xml-streaming">yes</pragma>
  <JDBCConnection name="stocks">
  <url>jdbc:xquery:sqlserver://localhost:1433</url>
  <sqlxmlMapping>
  <forest>true</forest>
  <identifierEscaping>none</identifierEscaping>
  </sqlxmlMapping>
  </JDBCConnection>
</XQJConnection>

Retrieving Query Results

Download the PDF

DataDirect XQuery® is carefully designed to allow very large query results to be retrieved efficiently, and to allow incremental instantiation of query results, so that memory can also be used efficiently. If your query's results can be large, stream the results to an OutputStream or use SAX or StAX, so that retrieving the results does not introduce issues with memory or performance. Retrieving large query results as DOM can use large amounts of memory and significantly increase the time needed for processing queries.

If you want to write your query results to an output stream or a file, the easiest way is to use the writeSequence() method of the result set to write directly to the stream, as shown in the following code.

Example 11. Writing a Result Sequence to a Stream

XQSequence sequence = expression.executeQuery(query);
OutputStream os = new FileOutputStream("C:/temp/results.xml");
sequence.writeSequence(os);

Many XML programs process XML using SAX, and DataDirect XQuery® can write query results directly to a SAX event handler, as shown in the following code.

Example 12. Writing to a SAX Event Handler

CustomSAXEventHandler eventHandler = new CustomSAXEventHandler();
sequence.writeSequenceToSAX(eventHandler);

For many kinds of programming, pull processing using StAX is both convenient and efficient. You can get a StAX stream from a result sequence using the getSequenceAsStream() method:

Example 13. Retrieving Results as a StAX Stream

XMLStreamReader reader = sequence.getSequenceAsStream();

Avoiding Compensation in WHERE Clauses and Predicates

Download the PDF

When a database is queried, DataDirect XQuery® usually pushes queries to the database, where they are executed in SQL. However, some expressions, functions and casts are not supported by SQL, and are therefore performed in program memory, a strategy known as compensation. Unfortunately, when these operations are performed in main memory, they can not take advantage of indexes and other optimizations available in the database server. This can dramatically hurt performance.

The DataDirect XQuery® User's Guide and Reference clearly states which expressions, functions, and operators are compensated. Most of the commonly used functions that are compensated fall into the following categories:

  • String Functions that Use Pattern Matching
  • Date and Time Functions for XML Schema Date/Time Datatypes
  • Casts to XML Schema Datatypes not supported by the underlying database.
  • XML whitespace functions
  • FLWOR Expressions using the AT Clause
  • Numeric Predicates

When querying relational data, if a compensated function or operator is used in the WHERE clause of a FLWOR expression or in a predicate of a path expression, performance can suffer dramatically. For instance, in the following query, normalize-space() is compensated, and forces evaluation in memory; since this function must be called once for every row, the performance hit is very significant.

Example14. Avoid using compensated functions in a WHERE clause

for $h in collection('stocks.dbo.historical')/historical
where normalize-space($h/ticker) = 'AMZN'
return $h

One way to avoid compensation is to use a SQL function instead of a compensated XQuery function, as in the following example.

Example 15. Using a SQL function instead of a compensated function

declare function ddtek-sql:rtrim($in as xs:string) as xs:string external;
for $h in collection('historical')/historical
where ddtek-sql:rtrim($h/ticker) = 'AMZN'
return $h

When querying relational data, using numeric predicates or an AT clause in a FLWOR expression forces compensation, which can dramatically impact performance, especially when it forces a join to be done in memory. Consider the following query.

Example 16. Avoid AT clauses when querying relational data

for $user at $i in collection("users")/users,
$hold at $j in collection("holdings")/holdings
return
<holding>
<userid>{ $i }</userid>
<user>{ string($user/userid) }</user>
<stockid>{ $j }</stockid>
<stock>{ string($hold/stockticker) }</stock>
<share>{ string($hold/shares }</share>
</holding>

This query uses AT clauses to identify the individual users and stocks, and these AT clauses force DataDirect XQuery® to do the joins in memory instead of letting the database engine do them. Not only is this a dramatic performance hit, it's also a bad way to identify the users and holdings — because rows in a relational table do not generally have a guaranteed stable order, these identifiers might change each time the query is run. It is generally much better to use the primary key from a table to identify the rows of that table.

Example 17. For identifiers, rely on the primary keys from relational tables

(:-------------------------------------------------------------------
: userid is the primary key for collection("users")
: stockticker, userid is the primary key for collection("holdings")
--------------------------------------------------------------------:)
for $user in collection("users")/users,
$hold in collection("holdings")/holdings
return
<holding>
<user>{ string($user/userid) }</user>
<stock>{ string($hold/stockticker) }</stock>
<share>{ string($hold/shares }</share>
</holding>

Specifying SQL Generation Algorithms

Download the PDF

The algorithms DataDirect XQuery® uses to create hierarchies are optimal for creating large, complex documents, but still good for smaller documents with shallower hierarchies. If query results contain four or fewer levels of hierarchy, and your performance seems inadequate, you will sometimes get better performance if you tell DataDirect XQuery® to use either outer join or sorted outer union (see the DataDirect 2.0 User's Guide and Reference for details on these algorithms). There are several ways to specify SQL generation algorithms in DataDirect XQuery®:

Example 18. Specifying SQL Generation Algorithms

The algorithm can be specified via the XQJ API:

XQDataSource ds = new DDXQDataSource();
ds.setJdbcPragmas("sql-rewrite-algorithm=nested-loop");

The algorithm can be specified with an entry in the source configuration file:

<pragma name="sql-rewrite-algorithm">outer-join</pragma>

The algorithm can be specified in the XQuery prolog with a declaration option:

declare option ddtek:sql-rewrite-algorithm "outer-join";

How to Split XML Files

A common challenge in XML development is splitting a large XML document in several smaller ones. In some scenarios such huge XML documents are simply unmanageable, requiring the need to split XML.

Split XML Files: An Example

Consider one of the Shakespeare plays marked up in XML, for example Hamlet. We want to split XML for this play, each speech in a separate XML document.

for $speech at $i in
/PLAY//SPEECH
let $url := concat("C:/SHAKESPEAR/",
string-join($speech/SPEAKER,"_"),
$i, ".xml")
return ddtek:serialize-to-url($speech, $url,
"omit-xml-declaration=no,indent=yes")

It outputs 1138 files in the C:\SHAKESPEAR directory.

Or you might need to generate an HTML file for each speech, and have a document referencing them all. In the next example, each of the speech HTML files are created through ddtek:serialize-to-url, the query result is the master HTML referencing the others.

<html>{
for $speech at $i in
/PLAY//SPEECH
let $url := concat("C:/SHAKESPEAR/",
string-join($speech/SPEAKER,"_"),
$i, ".html")
let $htmlspeech := <html>
<b>{$speech/SPEAKER/text()}</b>:<br/>{
for $line in $speech/LINE
return
($line/text(),<br/>)
}</html>
return
(
ddtek:serialize-to-url($htmlspeech, $url, "method=html"),
<a href="{$url}">{
$speech/SPEAKER/text()
}</a>,
<br/>
)
}</html>

Using the ddtek:serialize-to-url function it becomes very natural to split XML documents in multiple ones. And of course, all this taking advantage of DataDirect XQuery's XML Streaming capabilities, enabling support for huge XML documents.

How to Split XML Files

In this tutorial you learned how to split XML into multiple files using XQuery. It's easy and you can split an XML file yourself, download a free trial of DataDirect XQuery today.

Summary

Download the PDF

DataDirect XQuery® was designed for performance and scalability. It does sophisticated query optimization, supports prepared queries and query pooling, and efficiently queries large XML documents using document projection and streaming. If you follow these few simple guidelines, your DataDirect XQuery® applications will usually have good performance:

  • Prepare your queries, or use query pooling.
  • Use fn:doc() when possible for XML input; avoid parsing XML input yourself.
  • For large XML files, avoid wildcards ('*'), and use specific paths.
  • For large XML files, consider streaming if memory usage is large and performance is unsatisfactory.
  • Use SAX, StAX, or output streams for output; avoid using the DOM for output.
  • Avoid compensation in WHERE clauses and predicates.
  • Avoid AT clauses in FLWOR expressions that query relational tables.
  • Avoid numeric predicates in expressions that operate on relational data.
  • For shallow hierarchies, consider specifying a SQL generation algorithm.
TUTORIAL

Performance and Optimization Tips for XQuery

jdbc

DataDirect
XQuery Connectors

XQuery Integration to Various Data Sources

community

DataDirect
XQuery Tutorials

Discussions, tips and tricks for
DataDirect Connect drivers