A few weeks ago I blogged about XQuery your office documents. We can query our Office Open XML and OpenDocument Format documents, because they are XML based. But what about older formats? For example, there are a zillion of Excel 2003 spreadsheets, and they will be around for another few years. Wouldn't it be great if we can query those, just as we can query OOXML and ODF documents through DataDirect XQuery?
JAXP defines the URIResolver interface. A URIResolver turns a URI into a 'virtual' XML document. The concept of URIResolver is supported by most Java-based XPath, XSLT and XQuery implementations and thus also by DataDirect XQuery.
URIResolvers allow you to the query any proprietary format through XQuery, as long as you go through the effort of some Java coding to transform the legacy format to XML.
There are also products using the URIResolver interface to make non-XML data standards available to the XML eco-system. The XML Converters, for example, allow to query many non-XML data formats like the EDI standards X12, EDIFACT, EANCOM, HL7, etc, as well as dBASE, CSV, JSON and many others.
We're going to use the concept of URIResolvers to query our Excel 2003 spreadsheets. But of course, we need a Java implementation to read these XLS files. Apache POI is such Java API to access Microsoft file formats, including Excel 2003 documents.
The advantages of Apache POI for Excel are,
So we have written a URIResolver to read Excel 2003 files. The xqexcel.jar file is available here, and needs to be added to your CLASSPATH. You also need two Apache POI 3.0.1 jar files, poi-3.0.1-FINAL-20070705.jar and poi-scratchpad-3.0.1-FINAL-20070705.jar. The Apache POI distribution can be downloaded here.
Enabling the ExcelURIResolver in DataDirect XQuery is trivial. If you are using the XQJ API, you can simply register the ExcelURIResolver through your DDXQDataSource.
DDXQDataSource ddds = new DDXQDataSource();
Using the DataDirect XQuery command-line utility, all you need to do is adding the -r option specifying the class name of the Excel URIResolver, com.ddtek.xquery.excel.ExcelURIResolver.
And you're all set to query Excel 2003 documents through the fn:doc() function. Use the excel: URI scheme, specifying the file name of the .XLS
[cc lang="java"]fn:doc('excel:C:/my office documents/sales2007.xls')[/cc]
An Excel 2003 document is called a workbook and can contain several sheets, each sheet is a grid of cells. Our ExcelURIResolver makes the following information available through the virtual XML document.
As an example, consider a sample Excel file, ciscoexpo.xls, from Microsoft's web site. The file consists of one sheet called Sheet1 and it looks as follows.
The columns A and B (Year and Sales) contain plain data (numbers) and the columns C and D (Predication and Ratio) contain formula's. Cell C5 is for example =58.552664*EXP(0.569367*A5) and D5 is =C5/C4.
When we query the complete document,
[cc lang="java"]fn:doc('excel:C:/my office documents/ciscoexpo.xls')[/cc]
We get the following virtual XML document.
Accessing cell B5 to B7 in XQuery world,
[cc lang="xquery"]let $xls := fn:doc('excel:C:/my office documents/ciscoexpo.xls')
let $sheet := $xls/workbook/sheet[@name="Sheet1"]
ConclusionWe have shown how to open data locked up for years in your Excel spreadsheets. We can now query this virtual XML document like we can with any other XML document, opening a wide range of use cases.
But also, the concept of URI Resolver is powerful, and allows you basically to query any proprietary data through XQuery.
View all posts from Marc Van Cappellen on the Progress blog. Connect with us about all things application development and deployment, data integration and digital business.
Copyright © 2018 Progress Software Corporation and/or its subsidiaries or affiliates.
All Rights Reserved.
Progress, Telerik, and certain product names used herein are trademarks or registered trademarks of Progress Software Corporation and/or one of its subsidiaries or affiliates in the U.S. and/or other countries. See Trademarks for appropriate markings.