Home Services Partners Company
XQuery your office documents

XQuery your office documents

September 18, 2007 0 Comments

With the introduction of the Office Open XML in Microsoft Office 2007 and OpenDocument Format used by OpenOffice applications, your office documents like spreadsheets and word processing documents, become consumable with XQuery.

Technically these formats are basically a bunch of XML files, packaged in a ZIP file. Here is for example an Office Open XML .docx file opened with winzip,

Although the format is different, the OpenDocument Format is conceptually similar, a bunch of XML files packaged in a ZIP.

How can you query the XML files inside the .docx file? After all it’s fairly simple with with DataDirect XQuery using the standard fn:doc XQuery function. fn:doc has a single argument, the URL identifying the XML document to query. Amongst the standard URL schemes like file: and http:, your Java virtual machine also support the jar: URL scheme.

A jar archive is considered "a zip archive with logical extensions". The "logical extensions" being special files like manifest.mf or the META-INF directory located in the archives. But physically these are just zip archives and as such we can use the jar: URL scheme to query Office Open XML documents. For example, to query the main document from our example above, use the following fn:doc call,

[cc lang="xquery"]doc('jar:file:///C:/example.docx!/word/document.xml')[/cc]

Let’s take a concrete example, and extract all the comments from John, ordered by date. The Office Open XML format stores the comments in the comments.xml document(what else would you have expected?).

[cc lang="xquery"]declare namespace w = "http://schemas.openxmlformats.org/wordprocessingml/2006/main"; { for $comment in doc('jar:file:///C:/example.docx!/word/comments.xml')/*/w:comment where $comment/@w:author = 'John' order by xs:dateTime($comment/@w:date) return {$comment//text()} }[/cc]

Well, I guess you get the idea...

It’s not only about simply querying your office documents. Using the out-of–the-box DataDirect XQuery and XML Converters features, a whole new range of capabilities and use cases become available. For example, extract data out of 'document forms' and save it in your database. Or extract and aggregate data out of a spreadsheet to generate EDI messages.

We have shown how to query your the OpenDocument Format and Office Open XML documents. In a future post, we’ll show how the "older" office formats can be queried using a Custom URI Resolver.
Imagine you can query your existing Excel files using DataDirect XQuery. Stay tuned!

digg_skin = 'compact'; digg_url = 'http://www.xml-connection.com/2007/09/xquery-your-office-documents.html';

Marc Van Cappellen

View all posts from Marc Van Cappellen on the Progress blog. Connect with us about all things application development and deployment, data integration and digital business.

Comments are disabled in preview mode.