Spreadsheet::XLSX and humongous spreadsheets

May 29, 2011

I've been using Spreadsheet::Read backed by Spreadsheet::ParseExcel and Spreadsheet::XLSX to process Excel spreadsheets, and the combination works reasonably well (despite various frustrating problems in retrieving dates).

Today, I wrote code with lots of diagnostics to parse a large spreadsheet I'd never seen before. The code ran properly for a while, and then complained of various inconsistencies in the data. After much digging, I realised that Spreadsheet::XLSX wasn't parsing columns beyond ZZ. LibreOffice showed me data that my Perl code wasn't seeing; but it, too, has an arbitrary 1024 column limit. I had to unzip the XLSX file and read the XML source to find the real number of columns: 1059.

Until today, I'd never imagined a spreadsheet that big either.

The fault lies with the following code in Spreadsheet/XLSX.pm, which assumes that the column name is two letters followed by a number:

foreach ($member_sheet -> contents =~ /(\<.*?\/?\>|.*?(?=\<))/g) {
    if (/^\<c r=\"([A-Z])([A-Z]?)(\d+)\"/) {
        $col = ord ($1) - 65;
                
        if ($2) {
            $col++;
            $col *= 26;
            $col += (ord ($2) - 65);
        }

Fortunately, it's not hard to extend the code to handle longer names:

foreach ($member_sheet -> contents =~ /(\<.*?\/?\>|.*?(?=\<))/g) {
    if (/^\<c r=\"([A-Z])([A-Z]*)(\d+)\"/) {
    $col = ord ($1) - 65;

    my $rest = $2;
    while (my $l = substr ($rest, 0, 1, '')) {
        $col++;
        $col *= 26;
        $col += (ord ($l) - 65);
    }

I sent the patch to Dmitry Ovsyenko, author of Spreadsheet::XLSX, but received no response.