XQJ Part XI - Processing large inputs

XQJ Part XI - Processing large inputs

Posted on November 25, 2007 0 Comments

Today's post in the XQJ series explains how to handle and query large XML documents through the XQJ API.

 

Since XML became a standard in the late 90's, we have been taught that XML is a tree; and the most intuitive (and popular) representation of such tree has been (still is!) the Document Object Model (DOM).

 

When you think about querying XML documents, using XQuery, XSLT or XPath, you usually think about a processor that navigates the DOM tree, extracts, compares the values it needs, and it creates another DOM as a result of those operations. Which is indeed what happens using typical XML processing implementations. Although today's processors use a more optimal representation than DOM, the problem remains the same, scalability.

 

What happens if the XML you are dealing with cannot be represented in the physical constraints of the memory available to your application? That's usually the limit that typical "in-memory" XQuery, XSLT, XPath implementations hit. But what if you were able to forget about DOMs, forget about materializing in memory the whole XML tree and do XML processing in a purely streaming fashion?

 

Using an XQuery streaming processor, like DataDirect XQuery, is a good start. But a chain is only as strong as the weakest link. Beside the streaming capabilities of your XQuery implementation, also the API must have the provision to handle those large XML fragments.

 

From an XQuery API perspective, it is crucial that the input to your query can be handled in a streaming fashion. In XQJ Part VIII - Binding external variables we learned how to bind values to external variables declared in an xquery. By default, binding a value to an XQExpression or XQPreparedExpression using bindXXX(), it is consumed during the binding process, and it stays active and valid for all subsequent execution cycles. We say that XQJ operates in 'immediate binding mode'.
Let's look closely at one of the pipeline examples from the previous post in this series.[cc lang="java"]... XQExpression xqe1; XQSequence xqs1;

 

xqe1 = xqc.createExpression(); xqs1 = xqe1.executeQuery("doc('orders.xml')//order");

XQExpression xqe2; xqe2 = xqc.createExpression(); xqe2.bindSequence(xqs1); xqe1.close();

XQSequence xqs2; xqs2 = xqe2.executeQuery( "declare variable $orders as element(*,xs:untyped) external; " + "for $order in $orders " + "where $order/@status = 'closed' " + "return " + " { " + " $order/* " + " &lt/closed_order>"; xqs2.writeSequence(System.out, null); xqe2.close(); ...[/cc]

 

During the bindSequence() call, the complete xqs1 sequence is consumed. Subsequently we can safely close the xqe1 expression, freeing up any runtime resources it held. On the other hand, consuming the complete sequence during bindSequence() implies that the XQJ implementation has to buffer the data one way or the other for subsequent query evaluations. All this works perfectly fine as long as we're handling relative small XML instances. But as the data is buffered, it breaks all opportunities for the underlying XQuery processor to take advantage of its streaming capabilities.

 

If you know that the data bound to the external variable will be used for only a single XQuery execution, is there then a way to inform the XQJ/XQuery implementation of possible optimization opportunities, and use its streaming capabilities?

 

The default binding mode in XQJ is 'immediate', which means the value bound to an external variable is consumed during the bindXXX() method. In addition, an application has the ability to set the binding mode to 'deferred'. With deferred binding mode, the application gives a hint to the XQJ-implementation and underlying XQuery processor, to take advantage of its streaming capabilities. In deferred binding mode, bindings are only active for a single execution cycle. The application is required to explicitly re-bind values to every external variable before each execution.

 

You can change the binding mode through the XQStaticContext interface, as shown in the next example. Refer to Part VI in this series for more information on how to manipulate the static context.

 

[cc lang="java"]... XQStaticContext xqsc = xqc.getStaticContext(); // change the binding mode xqsc.setBindingMode(XQConstants.BINDING_MODE_DEFERRED); // make the changes effective xqc.setStaticContext(xqsc); ...[/cc]

 

In deferred mode the application cannot assume that the bound value will be consumed during the invocation of the bindXXX() method. The XQJ-implementation is free to read the bound value either at bind time or during the subsequent evaluation and processing of the query results. This has some consequences on when resources can be cleaned up. If we consider the first example again, it will not work properly in deferred binding mode. Note that xqe1 was closed right after calling bindSequence(). The example needs to be modified as follows,

 

[cc lang="java"]... XQExpression xqe1; XQSequence xqs1;

 

xqe1 = xqc.createExpression(); xqs1 = xqe1.executeQuery("doc('orders.xml')//orders"); XQExpression xqe2 = xqc.createExpression(); xqe2.bindSequence(xqs1);

XQSequence xqs2 = xqe2.executeQuery( "declare variable $orders as element(*,xs:untyped) external; " + "for $order in $orders " + "where $order/@status = 'closed' " + "return " + " { " + " $order/* " + " &lt/closed_order>"; xqs2.writeSequence(System.out, null); xqe2.close(); xqe1.close(); ...[/cc]

 

This example shows how to build a pipeline of xqueries. But deferred binding mode applies also to the other bindXXX() methods. In the next example we show how to bind a StreamSource to the context item. As binding mode is deferred, the implementation can handle the query in streaming mode and as such process huge XML documents that don't fit in available memory.

 

[cc lang="java"]... XQStaticContext xqsc = xqc.getStaticContext(); // change the binding mode xqsc.setBindingMode(XQConstants.BINDING_MODE_DEFERRED); // make the changes effective xqc.setStaticContext(xqsc);

 

XQExpression xqe; XQSequence xqs;

xqe = xqc.createExpression(); xqe.bindDocument( XQConstants.CONTEXT_ITEM, new StreamSource("large_orders_document.xml")); xqs = xqe.executeExpression("/orders/order") ...[/cc]

 

To conclude, using deferred binding mode requires a little more care than immediate. But the potential improvements when querying large XML documents is enormous. Of course, the API needs to provide the necessary functionality, but the heavy lifting is performed in the underlying XQuery processor. Especially with DataDirect XQuery, where deferred binding mode allows you to both take advantage of XML document projection and its XML streaming capabilities. This allows to query XML documents in the hundreds of megabytes, even in the gigabytes!

 

digg_skin = 'compact'; digg_url = 'http://www.xml-connection.com/2007/11/xqj-part-xi-processing-large-inputs.html';

 

XQJ

Marc Van Cappellen

View all posts from Marc Van Cappellen on the Progress blog. Connect with us about all things application development and deployment, data integration and digital business.

Comments

Comments are disabled in preview mode.
Topics

Sitefinity Training and Certification Now Available.

Let our experts teach you how to use Sitefinity's best-in-class features to deliver compelling digital experiences.

Learn More
Latest Stories
in Your Inbox

Subscribe to get all the news, info and tutorials you need to build better business apps and sites

Loading animation