The first time a query is executed, a significant percent of the total query time is spent preparing the query, which includes query optimization and generating the query plan. Because most programs invoke queries more than once, we spend the time needed to find an efficient way to process the query, even though that makes the first query slower. Fortunately, this effort does not need to be repeated. One way to ensure that a query is prepared only once in your program is to use a prepared query. (Another way is to use Query Pooling, described in the following section).
To parameterize a query, use external variables to change the values used for the query each time it is invoked. For instance, the following query creates a portfolio for a given user:
declare variable $user as xs:string external;
collection('holdings')/holdings[userid=$user]
In this query, $user is an external variable which must be bound before executing the query. The following XQJ code shows how to prepare the query and bind a value to $user.
// Get a connection, prepare the query
XQDataSource dataSource = new DDXQDataSource();
dataSource.setJdbcUrl("jdbc:xquery:sqlserver://server1:1433;databaseName=stocks");
XQConnection connection = dataSource.getConnection();
XQPreparedExpression preparedExpression = connection.prepareExpression(xqueryText);
// Bind variable $user to 'Jonathan' and execute
preparedExpression.bindString(new QName("user"), "Jonathan");
XQSequence xqSequence = preparedExpression.executeQuery();
The following code shows how to bind $user to a different value and execute prepared query again.
// Bind variable $user to 'Minollo' and execute
preparedExpression.bindString(new QName("user"), "Minollo");
xqSequence = preparedExpression.executeQuery();
Benchmarks that measure the speed of a query should mirror the usage you expect in your program. Since most programs execute a query many times, when you design a benchmark you should prepare your query, then execute it many times.
Prepared expressions let the programmer decide when a query should be prepared and when a prepared query is no longer needed. In programs that use hundreds or thousands of queries, it is generally better to let DataDirect XQuery® keep track of which queries are actually being used, preparing them the first time they are executed, and discarding the least frequently used prepared queries if there are too many queries. This is called query pooling.
In DataDirect XQuery®, you enable query pooling by specifying the number of prepared queries to hold in the pool. If you configure your connection by setting properties in the Java API, you can set the MaxPooledQueries property to do this; for instance, the following code sets the size of the query pool to 20:
XQDataSource dataSource =
new
DDXQDataSource();
dataSource.setJdbcUrl("jdbc:xquery:sqlserver:
//server1:1433;
databaseName=stocks");
dataSource.setMaxPooledQueries(20);
XQConnection connection = dataSource.getConnection();
If you configure your connection using a Source Configuration File, you can set the size of the query pool using the maxPooledQueries element:
<?
xml
version
=
"1.0"
encoding
=
"UTF-8"
?>
<
XQJConnection
xmlns
=
"http://www.datadirect.com/xquery"
>
<
maxPooledQueries
>20</
maxPooledQueries
>
<
JDBCConnection
name
=
"stocks"
>
<
url
>jdbc:xquery:sqlserver://localhost:1433</
url
>
<
sqlxmlMapping
>
<
forest
>true</
forest
>
<
identifierEscaping
>none</
identifierEscaping
>
</
sqlxmlMapping
>
</
JDBCConnection
>
</
XQJConnection
>
DataDirect XQuery® provides several ways to query an XML file. In DataDirect XQuery®, you will get better performance, especially for large XML files, if you use fn:doc() to access XML in a query. You will get much worse performance if you parse the XML to create a DOM tree, bind the DOM tree to an external variable, and query the external variable.
In most XQuery processors, a large percent of the time needed to query a large XML file is spent parsing the file and creating an in-memory representation that can be queried. The in-memory representation may be many times the size of the original XML file, and Java VM out of memory errors may occur at run time. When your query addresses an XML file using fn:doc(), DataDirect XQuery® uses a technique known as Document Projection, which allows it to create only the part of a document needed by the query. This results in dramatic improvements in memory usage and scalability, and significant improvements in performance. Using document projection, you can generally query documents many times the size of available memory. Incidentally, if your application uses XML Deployment Adapters, it addresses these adapters using fn:doc(), and DataDirect XQuery® uses Document Projection appropriately for these adapters just as it does for XML documents.
Document projection uses the path expressions in a query to determine what parts of a document need to be built. Path expressions that use wildcards ('*') provide less information for document projection, so they should be avoided when querying large documents. For instance, in the examples below, the second path expression will perform much better for large documents.
doc(
"scenario.xml"
)
//*[role='teacher']
doc(
"scenario.xml"
)/scenario/people/person[role=
'teacher'
]
If you parse a document to create a DOM tree, then bind the DOM tree to an external variable, you bypass DataDirect's Document Projection. This forces your program to create the entire in-memory representation of your XML document — not just the part that DataDirect XQuery® needs to process your queries. In addition to the savings due to Document Projection, DataDirect's internal representation of an XML document is much more efficient than most DOM implementations, so using fn:doc() gives you two substantial optimizations that you lose if you parse the document yourself.
DataDirect XQuery® also supports a second technique known as Streaming. Streaming processes a document sequentially, discarding portions of the document that are no longer needed to produce further query results. This reduces memory usage because only the portion of a document needed at a given stage of query processing is instantiated in memory. Streaming can reduce memory consumption, but generally involves a (relatively minor) performance penalty. Unlike document projection, streaming is not generally a performance win. Also, some queries require much of the document to be instantiated at a given time, diminishing the benefits of streaming. For some queries, however, streaming can be a substantial win, especially if the query results exceed available memory; therefore, we allow streaming to be enabled, but we do not enable it by default.
To enable streaming, set the ddtek:xml-streaming option to 'yes' in the XQuery Prolog:
declare option ddtek:xml-streaming
'yes'
;
You can also enable streaming at the DataSource level, which means streaming will be used for all queries that use the DataSource. One way to do this is to set the Pragmas property of the DDXQDataSource class.
XQDataSource dataSource =
new
DDXQDataSource();
dataSource.setJdbcUrl("jdbc:xquery:sqlserver:
//server1:1433;
databaseName=stocks");
dataSource.setPragmas(
"xml-streaming=yes"
);
Another way to enable streaming at the DataSource level is to specify the declaration option using a pragma element in a Source Configuration File:
<?
xml
version
=
"1.0"
encoding
=
"UTF-8"
?>
<
XQJConnection
xmlns
=
"http://www.datadirect.com/xquery"
>
<
maxPooledQueries
>20</
maxPooledQueries
>
<
pragma
name
=
"xml-streaming"
>yes</
pragma
>
<
JDBCConnection
name
=
"stocks"
>
<
url
>jdbc:xquery:sqlserver://localhost:1433</
url
>
<
sqlxmlMapping
>
<
forest
>true</
forest
>
<
identifierEscaping
>none</
identifierEscaping
>
</
sqlxmlMapping
>
</
JDBCConnection
>
</
XQJConnection
>
DataDirect XQuery® is carefully designed to allow very large query results to be retrieved efficiently, and to allow incremental instantiation of query results, so that memory can also be used efficiently. If your query's results can be large, stream the results to an OutputStream or use SAX or StAX, so that retrieving the results does not introduce issues with memory or performance. Retrieving large query results as DOM can use large amounts of memory and significantly increase the time needed for processing queries.
If you want to write your query results to an output stream or a file, the easiest way is to use the writeSequence() method of the result set to write directly to the stream, as shown in the following code.
XQSequence sequence = expression.executeQuery(query);
OutputStream os = new FileOutputStream("C:/temp/results.xml");
sequence.writeSequence(os);
Many XML programs process XML using SAX, and DataDirect XQuery® can write query results directly to a SAX event handler, as shown in the following code.
CustomSAXEventHandler eventHandler = new CustomSAXEventHandler();
sequence.writeSequenceToSAX(eventHandler);
For many kinds of programming, pull processing using StAX is both convenient and efficient. You can get a StAX stream from a result sequence using the getSequenceAsStream() method:
XMLStreamReader reader = sequence.getSequenceAsStream();
When a database is queried, DataDirect XQuery® usually pushes queries to the database, where they are executed in SQL. However, some expressions, functions and casts are not supported by SQL, and are therefore performed in program memory, a strategy known as compensation. Unfortunately, when these operations are performed in main memory, they can not take advantage of indexes and other optimizations available in the database server. This can dramatically hurt performance.
The DataDirect XQuery® User's Guide and Reference clearly states which expressions, functions, and operators are compensated. Most of the commonly used functions that are compensated fall into the following categories:
When querying relational data, if a compensated function or operator is used in the WHERE clause of a FLWOR expression or in a predicate of a path expression, performance can suffer dramatically. For instance, in the following query, normalize-space() is compensated, and forces evaluation in memory; since this function must be called once for every row, the performance hit is very significant.
for $h in collection('stocks.dbo.historical')/historical
where normalize-space($h/ticker) = 'AMZN'
return $h
One way to avoid compensation is to use a SQL function instead of a compensated XQuery function, as in the following example.
declare
function
ddtek-sql:rtrim($
in
as
xs:string)
as
xs:string external;
for
$h
in
collection(
'historical'
)/historical
where
ddtek-sql:rtrim($h/ticker) =
'AMZN'
return
$h
When querying relational data, using numeric predicates or an AT clause in a FLWOR expression forces compensation, which can dramatically impact performance, especially when it forces a join to be done in memory. Consider the following query.
for
$
user
at
$i
in
collection(
"users"
)/users,
$hold
at
$j
in
collection(
"holdings"
)/holdings
return
<holding>
<userid>{ $i }</userid>
<
user
>{ string($
user
/userid) }</
user
>
<stockid>{ $j }</stockid>
<stock>{ string($hold/stockticker) }</stock>
<share>{ string($hold/shares }</share>
</holding>
This query uses AT clauses to identify the individual users and stocks, and these AT clauses force DataDirect XQuery® to do the joins in memory instead of letting the database engine do them. Not only is this a dramatic performance hit, it's also a bad way to identify the users and holdings — because rows in a relational table do not generally have a guaranteed stable order, these identifiers might change each time the query is run. It is generally much better to use the primary key from a table to identify the rows of that table.
(:
-------------------------------------------------------------------
: userid
is
the
primary
key
for
collection(
"users"
)
: stockticker, userid
is
the
primary
key
for
collection(
"holdings"
)
--------------------------------------------------------------------:)
for
$
user
in
collection(
"users"
)/users,
$hold
in
collection(
"holdings"
)/holdings
return
<holding>
<
user
>{ string($
user
/userid) }</
user
>
<stock>{ string($hold/stockticker) }</stock>
<share>{ string($hold/shares }</share>
</holding>
The algorithms DataDirect XQuery® uses to create hierarchies are optimal for creating large, complex documents, but still good for smaller documents with shallower hierarchies. If query results contain four or fewer levels of hierarchy, and your performance seems inadequate, you will sometimes get better performance if you tell DataDirect XQuery® to use either outer join or sorted outer union (see the DataDirect 2.0 User's Guide and Reference for details on these algorithms). There are several ways to specify SQL generation algorithms in DataDirect XQuery®:
The algorithm can be specified via the XQJ API:
XQDataSource ds = new DDXQDataSource();
ds.setJdbcPragmas(
"sql-rewrite-algorithm=nested-loop"
);
The algorithm can be specified with an entry in the source configuration file:
<pragma
name
=
"sql-rewrite-algorithm"
>
outer
-
join
</pragma>
The algorithm can be specified in the XQuery prolog with a declaration option:
declare
option
ddtek:sql-rewrite-algorithm
"outer-join"
;
A common challenge in XML development is splitting a large XML document in several smaller ones. In some scenarios such huge XML documents are simply unmanageable, requiring the need to split XML.
Consider one of the Shakespeare plays marked up in XML, for example Hamlet. We want to split XML for this play, each speech in a separate XML document.
for $speech at $i in
/PLAY//SPEECH
let $url := concat("C:/SHAKESPEAR/",
string-join($speech/SPEAKER,"_"),
$i, ".xml")
return ddtek:serialize-to-url($speech, $url,
"omit-xml-declaration=no,indent=yes")
It outputs 1138 files in the C:\SHAKESPEAR
directory.
Or you might need to generate an HTML file for each speech, and have a document referencing them all. In the next example, each of the speech HTML files are created through ddtek:serialize-to-url
, the query result is the master HTML referencing the others.
<
html
>{
for $speech at $i in
/PLAY//SPEECH
let $url := concat("C:/SHAKESPEAR/",
string-join($speech/SPEAKER,"_"),
$i, ".html")
let $htmlspeech := <
html
>
<
b
>{$speech/SPEAKER/text()}</
b
>:<
br
/>{
for $line in $speech/LINE
return
($line/text(),<
br
/>)
}</
html
>
return
(
ddtek:serialize-to-url($htmlspeech, $url, "method=html"),
<
a
href
=
"{$url}"
>{
$speech/SPEAKER/text()
}</
a
>,
<
br
/>
)
}</
html
>
Using the ddtek:serialize-to-url
function it becomes very natural to split XML documents in multiple ones. And of course, all this taking advantage of DataDirect XQuery's XML Streaming capabilities, enabling support for huge XML documents.
In this tutorial you learned how to split XML into multiple files using XQuery. It's easy and you can split an XML file yourself, download a free trial of DataDirect XQuery today.
DataDirect XQuery® was designed for performance and scalability. It does sophisticated query optimization, supports prepared queries and query pooling, and efficiently queries large XML documents using document projection and streaming. If you follow these few simple guidelines, your DataDirect XQuery® applications will usually have good performance: