Fetching cells, ranges and tables from MS Excel .xlsx spreadsheets

Fetching cells, ranges and tables from MS Excel .xlsx spreadsheets

by Minollo
January 09, 2009 0 Comments

Now and then I need to write applications that use data stored in Excel spreadsheets. As Marc has already discussed in other posts, the new Open Office XML format used by recent releases of Excel, Word and Powerpoint opens the door to the possibility of using XML tools to manipulate Office documents without going through the pain of looking for and using conversion utilities.
Recently I had to solve a problem involving fetching a reference table from a spreadsheet; to better understand how you can fetch data from spreadsheets serialized as OOXML, I searched for some documentation describing the structure of OOXML files, which is not entirely trivial. I ended up finding this useful entry in MSDN which describes how to directly access OOXML documents to accomplish a handful of operations using VB or C#, and in particular how to retrieve cell values.
Given a OOXML spreadsheet, a worksheet name and a cell ID, retrieving the cell value involves basically the following steps:

  • Given the workbook, find the worksheet id corresponding to the specified name
  • Given the worksheet id, find the corresponding worksheet document
  • Give the worksheet document, find the cell corresponding to the requested col/row pair
  • Given the cell, either retrieve the cell value directly if available or lookup the cell value in a document containing a list of shared/pooled strings

As you can see, what you would consider a relatively simple task is not that simple. The MSDN article provides solutions in VB and C#; you can see how the code looks like here.
Of course you can solve the same problem using XQuery; and, as you might (should?) expect, doing that in XQuery is way simpler - considering we are manipulating and navigating XML documents: [cc lang="xquery"]declare namespace ooxl = ""; declare namespace r = ""; declare namespace rp = "";

declare function local:get-cell($spreadsheetURI as xs:string, $sheetName as xs:string, $cell as xs:string) { let $stringsDoc := doc(concat("zip:", $spreadsheetURI, "!/xl/sharedStrings.xml")) let $workbookDoc := doc(concat("zip:", $spreadsheetURI, "!/xl/workbook.xml")) let $relsDoc := doc(concat("zip:", $spreadsheetURI, "!/xl/_rels/workbook.xml.rels")) let $sheetFileName := $relsDoc/rp:Relationships/rp:Relationship[@Id = $workbookDoc/ooxl:workbook/ooxl:sheets/ooxl:sheet[@name=$sheetName]/@r:id]/@Target let $sheetDoc := doc(concat("zip:", $spreadsheetURI, "!/xl/", $sheetFileName)) let $cellNode := $sheetDoc/ooxl:worksheet/ooxl:sheetData/ooxl:row/ooxl:c[@r = $cell] return if ($cellNode/@t="s") then $stringsDoc/ooxl:sst/ooxl:si[$cellNode/ooxl:v+1]/ooxl:t/string() else if ($cellNode/@t="b") then if ($cellNode/ooxl:v = 1) then "TRUE" else "FALSE" else $cellNode/ooxl:v/string()}; [/cc]
You can test the function with an expression like: [cc lang="xquery"]local:get-cell("file:///c:/mySpreadsheet.xlsx", "2008Data", "B2")[/cc]
Encouraged by this simple test, I spent some more time trying to achieve what my original goal was: retrieving tables and ranges in my spreadsheet. The result is the XQuery attached to this post, that I developed and tested against DataDirect XQuery; in addition to the get-cell() function, there are two top-level functions that I found particularly useful:[cc lang="xquery"] declare function local:get-table($spreadsheetURI as xs:string, $sheetName as xs:string, $tableName as xs:string) as element(); declare function local:get-range($spreadsheetURI as xs:string, $sheetName as xs:string, $range as xs:string) as element(); [/cc]
Both functions return data in a simple XML format controlled by the get-range-impl() function; you can change it the way you like, of course.
If you have the patience to go through the XQuery, you'll notice that I had to create a few interesting functions to convert numeric col/row values to/from the typical letters/numbers notation used in spreadsheets ("AB21" -> 28,21; 512,99 -> "SR99"); that's probably the main difference between the functions manipulating ranges/tables and the one manipulating single cells is. I'm sure there are more intelligent and performant ways to obtain a similar result (maybe playing with regular expressions), but this was good enough starting point for me.
Hopefully you'll find these XQuery functions as useful as I did; and hopefully you'll find some bugs I missed ;)



View all posts from Minollo on the Progress blog. Connect with us about all things application development and deployment, data integration and digital business.

Comments are disabled in preview mode.