XQJ Tutorial Part XI: Processing Large Inputs



This chapter explains how to handle and query large XML documents through the XQJ API.

The Roots of DOM

Since XML became a standard in the late 90's, we have been taught that XML is a tree; and the most intuitive (and popular) representation of such a tree has been (still is!) the Document Object Model (DOM).

When you think about querying XML documents, using XQuery, XSLT or XPath, you usually think about a processor that navigates the DOM tree, extracts values, compares the values it needs, and creates another DOM as a result of those operations. This is indeed what happens using typical XML processing implementations. Although today's processors use a more optimal representation than DOM, one problem remains the same — scalability.

What happens if the XML you need to process cannot be represented within the physical constraints of the memory available to your application? That's usually the limit that typical "in-memory" XQuery, XSLT, XPath implementations hit when processing DOM, and short of writing your own SAX- or StAX-based processing, you really have no alternatives. But what if you are able to forget about DOMs, forget about materializing the whole XML tree in memory, and do XML processing in a purely streaming fashion?

The Benefits of XML Streaming

Using an XQuery streaming processor, like DataDirect XQuery, is a good start. But a chain is only as strong as the weakest link. Beside the streaming capabilities of your XQuery implementation, the API must have the provision to handle those large XML fragments.

From an XQuery API perspective, it is crucial that the input to your query can be handled in a streaming fashion. In XQJ Tutorial Part VIII: Binding External Variables, we learned how to bind values to external variables declared in an XQuery. By default, binding a value to an XQExpression or XQPreparedExpression using bindXXX(), it is consumed during the binding process, and it stays active and valid for all subsequent execution cycles. We say that XQJ operates in ‘immediate binding mode.’

Let's look closely at one of the pipeline examples from XQJ Tutorial Part X: XML Pipelines in this series.

...
XQExpression xqe1;
XQSequence xqs1;

xqe1 = xqc.createExpression();
xqs1 = xqe1.executeQuery("doc('orders.xml')//order");

XQExpression xqe2;
xqe2 = xqc.createExpression();
xqe2.bindSequence(xqs1);
xqe1.close();

XQSequence xqs2;
xqs2 = xqe2.executeQuery(
"declare variable $orders as element(*,xs:untyped) external; " +
"for $order in $orders " +
"where $order/@status = 'closed' " +
"return " +
" <closed_order id = '{$order/@id}'>{ " +
" $order/* " +
" </closed_order>";
xqs2.writeSequence(System.out, null);
xqe2.close();
...

Binding and Buffering

During the bindSequence() call, the complete xqs1 sequence is consumed. Subsequently, we can safely close the xqe1 expression, freeing up any runtime resources it holds. On the other hand, consuming the complete sequence during bindSequence() implies that the XQJ implementation has to buffer the data one way or the other for subsequent query evaluations. All this works perfectly fine as long as we're handling relatively small XML instances. But as the data is buffered, it deprives the underlying XQuery processor of opportunities to take advantage of its streaming capabilities.

The default binding mode in XQJ is 'immediate,' which means the value bound to an external variable is consumed during the bindXXX() method. An application also has the ability to set the binding mode to 'deferred.' With deferred binding mode, the application gives a hint to the XQJ implementation and underlying XQuery processor to take advantage of its streaming capabilities. In deferred binding mode, bindings are only active for a single execution cycle. The application is required to explicitly re-bind values to every external variable before each execution.

But what if you know that the data bound to the external variable will be used for only a single XQuery execution. Is there then a way to inform the XQJ/XQuery implementation of possible optimization opportunities, and use its streaming capabilities?

You can change the binding mode through the XQStaticContext interface, as shown in the following example (refer to XQJ Tutorial Part VI: Manipulating the Static Context for more information on how to manipulate the static context):

...
XQStaticContext xqsc = xqc.getStaticContext();
// change the binding mode
xqsc.setBindingMode(XQConstants.BINDING_MODE_DEFERRED);
// make the changes effective
xqc.setStaticContext(xqsc);
...

In deferred mode, the application cannot assume that the bound value will be consumed during the invocation of the bindXXX() method. The XQJ implementation is free to read the bound value either at bind time or during the subsequent evaluation and processing of the query results. This has some consequences for resource clean-up: the first example will not work properly in deferred binding mode, for example. In that example, nNote that xqe1 was closed right after calling bindSequence(). We can address this by modifying the example as follows:

...
XQExpression xqe1;
XQSequence xqs1;

xqe1 = xqc.createExpression();
xqs1 = xqe1.executeQuery("doc('orders.xml')//orders");
XQExpression xqe2 = xqc.createExpression();
xqe2.bindSequence(xqs1);

XQSequence xqs2 = xqe2.executeQuery(
"declare variable $orders as element(*,xs:untyped) external; " +
"for $order in $orders " +
"where $order/@status = 'closed' " +
"return " +
" <closed_order id = '{$order/@id}'>{ " +
" $order/* " +
" </closed_order>";
xqs2.writeSequence(System.out, null);
xqe2.close();
xqe1.close();
...

Binding Streams for Efficient Processing

The previous example shows how to build a pipeline of XQueries. But the deferred binding mode also applies to the other bindXXX() methods. In the following example, we show how to bind a StreamSource to the context item. As the binding mode is deferred, the implementation can handle the query in streaming mode, and as such process huge XML documents that don't otherwise fit in available memory.

...
XQStaticContext xqsc = xqc.getStaticContext();
// change the binding mode
xqsc.setBindingMode(XQConstants.BINDING_MODE_DEFERRED);
// make the changes effective
xqc.setStaticContext(xqsc);

XQExpression xqe;
XQSequence xqs;

xqe = xqc.createExpression();
xqe.bindDocument(
XQConstants.CONTEXT_ITEM,
new StreamSource("large_orders_document.xml"));
xqs = xqe.executeExpression("/orders/order")
...

To conclude, using deferred binding mode requires a little more care than is immediately apparent. But the potential improvements when querying large XML documents are enormous. Of course, the API needs to provide the necessary functionality, but the heavy lifting is performed in the underlying XQuery processor — this is especially true with DataDirect XQuery, whose deferred binding mode allows you to take advantage of both XML document projection and its XML streaming capabilities. This allows efficient querying of XML documents in the hundreds of megabytes, even in the gigabytes!