Search code examples
rubypdfpdf-reader

Parsing PDF removing month


I'm parsing a pdf that has some dates by splitting the lines and then searching them. The following are example lines:

Posted Date: 02/11/2015
Effective Date: 02/05/2015

When I find Posted Date, I split on the : and pull out 02/11/2015. But when I do the same for effective date, it only returns /05/2015. When I write all lines, it displays that date as /05/2015 while the PDF has the 02. Would 02 be converted to nil for some reason? Am I missing something? PDF part I'm parsing

lines = reader.pages[0].text.split(/\r?\n/)
lines.each_with_index do |line, index|
  values_to_insert = []
  if line.include? "Legal Name:"
    name_line = line.split(":")
    values_to_insert.push(name_line[1])
  end
  if line.include? "Active/Pending Insurance"
    topLine = lines[index+2].split(" ")
    middleLine = lines[index+5].split(" ")
    insuranceLine = lines[index + 7]
    insurance_line_split = insuranceLine.split(" ")
    insurance_line_split.each_with_index do |word, i|
      if word.include? "Insurance"
        values_to_insert.push(insuranceLine.split(":")[1])
      end
    end
    topLine.each_with_index do |word, i|
      if word.include? "Posted"
        values_to_insert.push(topLine[i + 2])
      end
    end
    middleLine.each_with_index do |word, i|
      if word.include? "Effective" or word.include? "Cancellation"
        #puts middleLine[0]
        puts middleLine[1]
        #puts middleLine[i + 1].split(":")[1]
      end
    end
  end
end

Here is what happens when I print all lines:

Active/Pending Insurance:

   Form:  91X               Type: BIPD/Primary                Posted Date: 02/11
/2015

   Policy/Surety Number:A 3491819            Coverage From:                $0
To:       $1,000,000
   Effective Date:/05/2015                 Cancellation Date:

  Insurance Carrier: PROGRESSIVE EXPRESS INSURANCE COMPANY

         Attn: CUSTOMER SERVICE
     Address:  P. O. BOX 94739
               CLEVELAND, OH 44101 US

    Telephone: (800) 444 - 4487   Fax: (440) 603 - 4555

Edited to show the code and even add a picture. I'm splitting by lines and then splitting again on colons and sometimes spaces. It's not amazingly clean but I don't think there's a much better way.


Solution

  • The problem occurs at positions where multiple pieces of text are on the same line but don't use exactly the same base line. In case of the PDF at hands,

    (at least) the policy number and the effective date are positioned slightly higher than their respective labels.

    The cause for this is the way the pdf-reader library used by the OP brings together the text pieces drawn on the page:

    • It determines a number of columns and rows to arrange the letters in and
    • creates an array of the rows number of strings filled with the columns number of spaces.
    • It then combines consecutive text pieces from the PDF on exactly the same base line and
    • finally puts these combined text pieces into the string array starting from the position best matching their starting position in the PDF.

    As fonts used in PDFs usually are not monospaced, this procedure can result in overlapping strings, i.e. erasure of one of the two. The step combining strings on the same baseline prevents erasure in that case, but for strings on slightly different base lines, this overlapping effect can still occur.

    What one can do, is increase the number of columns used here.

    The library in page_layout.rb defines

    def col_count
      @col_count ||= ((@page_width  / @mean_glyph_width) * 1.05).floor
    end
    

    As you see there already is some magic number 1.05 in use to slightly increase the number of columns. By increasing this number even more, no erasures as observed by the OP should occur anymore. One should not increase the factor too much, though, because that can introduce unwanted space characters where none belong.

    The OP reported that increasing the magic number to 1.10 sufficed in his case.