I'm having trouble with a crawler I'm building using rubyXL. It's correctly traversing my file system, but I am receiving an (Errno::ENOENT)
error. I've checked out all the rubyXL code and everything appears to check out. My code is attached below - any suggestions?
/Users/.../testdata.xlsx
/Users/.../moretestdata.xlsx
/Users/.../Lab 1 Data.xlsx
/Users/Dylan/.rvm/gems/ruby-1.9.3-p327/gems/rubyXL-1.2.10/lib/rubyXL/parser.rb:404:in `initialize': No such file or directory - /Users/Dylan/.../sheet6.xml (Errno::ENOENT)
from /Users/Dylan/.rvm/gems/ruby-1.9.3-p327/gems/rubyXL-1.2.10/lib/rubyXL/parser.rb:404:in `open'
from /Users/Dylan/.rvm/gems/ruby-1.9.3-p327/gems/rubyXL-1.2.10/lib/rubyXL/parser.rb:404:in `block in decompress'
from /Users/Dylan/.rvm/gems/ruby-1.9.3-p327/gems/rubyXL-1.2.10/lib/rubyXL/parser.rb:402:in `upto'
from /Users/Dylan/.rvm/gems/ruby-1.9.3-p327/gems/rubyXL-1.2.10/lib/rubyXL/parser.rb:402:in `decompress'
from /Users/Dylan/.rvm/gems/ruby-1.9.3-p327/gems/rubyXL-1.2.10/lib/rubyXL/parser.rb:47:in `parse'
from xlcrawler.rb:9:in `block in xlcrawler'
from /Users/Dylan/.rvm/rubies/ruby-1.9.3-p327/lib/ruby/1.9.1/find.rb:41:in `block in find'
from /Users/Dylan/.rvm/rubies/ruby-1.9.3-p327/lib/ruby/1.9.1/find.rb:40:in `catch'
from /Users/Dylan/.rvm/rubies/ruby-1.9.3-p327/lib/ruby/1.9.1/find.rb:40:in `find'
from xlcrawler.rb:6:in `xlcrawler'
from xlcrawler.rb:22:in `<main>'
require 'find'
require 'rubyXL'
def xlcrawler(path)
count = 0
Find.find(path) do |file| # begin iteration of each file of a specified directory
if file =~ /\b.xlsx$\b/ # check if a given file is xlsx format
puts file # ensure crawler is traversing the file system
workbook = RubyXL::Parser.parse(file).worksheets # creates an object containing all worksheets of an excel workbook
workbook.each do |worksheet| # begin iteration over each worksheet
data = worksheet.extract_data.to_s # extract data of a given worksheet - must be converted to a string in order to match a regex
if data =~ /regex/
puts file
count += 1
end
end
end
end
puts "#{count} files were found"
end
xlcrawler('/Users/')
I did some digging through the rubyXL code on github and it looks like there is a bug in the decompress method.
files['styles'] = Nokogiri::XML.parse(File.open(File.join(dir_path,'xl','styles.xml'),'r'))
@num_sheets = files['workbook'].css('sheets').children.size
@num_sheets = Integer(@num_sheets)
#adds all worksheet xml files to files hash
i=1
1.upto(@num_sheets) do
filename = 'sheet'+i.to_s # <----- BUG IS HERE
files[i] = Nokogiri::XML.parse(File.open(File.join(dir_path,'xl','worksheets',filename+'.xml'),'r'))
i=i+1
end
This block of code makes an assumption about sheet numbering in excel which is not true. This code simply counts the number of sheets, and assigns them numerically. However if you delete a sheet then create a new sheet the numerical sequence is broken.
If you check your Lab Data 1.xlsx
file you will see that there is no sheet6 if you pull up the vba developer window (by pressing alt + F11) you should see something like
As you can see this arrangement will defeat the for loop and cause an exception when i = 6.