I'm parsing a pdf that has some dates by splitting the lines and then searching them. The following are example lines:
Posted Date: 02/11/2015
Effective Date: 02/05/2015
When I find Posted Date
, I split on the :
and pull out 02/11/2015
. But when I do the same for effective date
, it only returns /05/2015
. When I write all lines, it displays that date as /05/2015
while the PDF has the 02
. Would 02
be converted to nil
for some reason? Am I missing something?
lines = reader.pages[0].text.split(/\r?\n/)
lines.each_with_index do |line, index|
values_to_insert = []
if line.include? "Legal Name:"
name_line = line.split(":")
values_to_insert.push(name_line[1])
end
if line.include? "Active/Pending Insurance"
topLine = lines[index+2].split(" ")
middleLine = lines[index+5].split(" ")
insuranceLine = lines[index + 7]
insurance_line_split = insuranceLine.split(" ")
insurance_line_split.each_with_index do |word, i|
if word.include? "Insurance"
values_to_insert.push(insuranceLine.split(":")[1])
end
end
topLine.each_with_index do |word, i|
if word.include? "Posted"
values_to_insert.push(topLine[i + 2])
end
end
middleLine.each_with_index do |word, i|
if word.include? "Effective" or word.include? "Cancellation"
#puts middleLine[0]
puts middleLine[1]
#puts middleLine[i + 1].split(":")[1]
end
end
end
end
Here is what happens when I print all lines:
Active/Pending Insurance:
Form: 91X Type: BIPD/Primary Posted Date: 02/11
/2015
Policy/Surety Number:A 3491819 Coverage From: $0
To: $1,000,000
Effective Date:/05/2015 Cancellation Date:
Insurance Carrier: PROGRESSIVE EXPRESS INSURANCE COMPANY
Attn: CUSTOMER SERVICE
Address: P. O. BOX 94739
CLEVELAND, OH 44101 US
Telephone: (800) 444 - 4487 Fax: (440) 603 - 4555
Edited to show the code and even add a picture. I'm splitting by lines and then splitting again on colons and sometimes spaces. It's not amazingly clean but I don't think there's a much better way.
The problem occurs at positions where multiple pieces of text are on the same line but don't use exactly the same base line. In case of the PDF at hands,
(at least) the policy number and the effective date are positioned slightly higher than their respective labels.
The cause for this is the way the pdf-reader library used by the OP brings together the text pieces drawn on the page:
As fonts used in PDFs usually are not monospaced, this procedure can result in overlapping strings, i.e. erasure of one of the two. The step combining strings on the same baseline prevents erasure in that case, but for strings on slightly different base lines, this overlapping effect can still occur.
What one can do, is increase the number of columns used here.
The library in page_layout.rb defines
def col_count
@col_count ||= ((@page_width / @mean_glyph_width) * 1.05).floor
end
As you see there already is some magic number 1.05
in use to slightly increase the number of columns. By increasing this number even more, no erasures as observed by the OP should occur anymore. One should not increase the factor too much, though, because that can introduce unwanted space characters where none belong.
The OP reported that increasing the magic number to 1.10
sufficed in his case.