Search code examples
rregexpdfexport-to-excel

Reading a specification document (PDF)with paragraphs and tables into a spreadsheet


My engineering team is gearing up for a bidding on a public project, where the specifications document is huge (~500 pages). I would like to break it down clause by clause in a spreadsheet and then assign the teams the relevant "portion". I checked, but PDF document is the only way these specs are provided.

The idea is to record it such that we can compare it with specifications of previous projects that are recorded in similar manner. I am still a trainee, so am not aware how this process works around different companies, but here in my team, the last project was documented manually in a similar manner.

The pages are arranged in indexed paragraphs (as 1, 1.1, 1.1.1 etc) with some tables and figures thrown around.

I hope to get a table like this:

Clause No. Clause Para
1.1. Lorem Ipsum
1.2. Lorem Ipsum

I asked around on PM Stackexchange if someone had some idea regarding any software suite, but I don't think there are many.

So I turned to R hoping that I could solve this maybe by parsing it using pdftools and a regex, and generally, while checking the code, I can get it to run on regex101.com to some extent (randomly selects few paragraphs, but fails when encounters a table) but somehow it does not return the same response when used with R.

I have no commitment to use R, but it is just that it was easilty available on my work laptop. Willing to try python or any other toolkit as well.

So far, I have been stuck on getting to make R get a single paragraph.

library(pdftools)
library(dplyr)
library(stringr)
library(purrr)

setwd("The work Dir/")
specDoc <- pdf_text("Spec Doc.pdf") %>% strsplit(split = "\n")
specDocChar <- as.character(specDoc)

get_clause <- str_trim(str_extract(specDocChar, "(?:^\n*(?:\\d\\.(?:\\d\\.)*)+)(.+?)$"))

get_clause

I tried the lookbehind also, but it seems to not work with flexible starting string lengths.

At this point I wish to know two things mainly.

  1. What am I doing incorrectly that I end up having a blank output
  2. Is there a more efficient way to tackle this particular problem, because after the paragraphs, I am not sure how to manage the tables within the paragraph, and para alone takes a little too much time.

A sample of how the page looks

Expected Output


Solution

  • As I've mentioned in our discussion/chat, this will be difficult and certainly imperfect.

    I've tried running your sample PDF through the following automatic extractors:

    and they both produced the same text, which completely loses the original structure:

     1. 
     1.1. 
    
     1.1.1. 
    
     1.1.2. 
    
     Lorem ipsum dolor sit amet consectetur adipiscing elit. 
     Pellentesque   a   sodales   arcu,   sed  feugiat  nibh.  Pellentesque  at  fermentum  odio,  a  molestie 
     lorem. Ut eleifend sagittis porta. 
     Integer   sit   amet   consectetur   erat.   Duis   sit   amet   urna   quam.   Pellentesque   turpis   tortor, 
     porttitor   eget  egestas  in,  tristique  in  urna.  Class  aptent  taciti  sociosqu  ad  litora  torquent  per 
     conubia  nostra,  per  inceptos  himenaeos.  Etiam  eleifend  tincidunt  volutpat.  Curabitur  eu  enim 
     viverra,  condimentum  ex  in,  elementum  est.  Integer  blandit  arcu  ex,  at  interdum  orci  viverra 
     in.
    

    Now, the real PDF may be composed differently and the extractors may do better (🤞). But trying to move on...

    The best could do with that LoremIpsumSpecs.pdf sample was just open it in Acrobat Reader, Edit → More → Select All, then copy-paste into a text editor to get something like the following:

    Specification for Project
    1. Lorem ipsum dolor sit amet consectetur adipiscing elit.
    1.1. Pellentesque a sodales arcu, sed feugiat nibh. Pellentesque at fermentum odio, a molestie
    lorem. Ut eleifend sagittis porta.
    1.1.1. Integer sit amet consectetur erat. Duis sit amet urna quam. Pellentesque turpis tortor,
    ...
    quis purus. Cras vitae dui fringilla libero posuere varius at et velit.
    Specs That Those
    Spec 1 High 2.1m
    Spec 2 Low 0
    Nunc magna urna, sagittis sit amet interdum quis, finibus non dui. In pharetra risus tincidunt
    ...
    3. Nunc eget maximus dolor. Integer orci purus, ultrices quis fringilla sit amet, blandit non erat.
    ...
    

    which preserves the structure of the section numbers and paragraphs, as well as the table.

    Does that resemble the text you're getting in your R script?

    If so, I would avoid trying to write one RegEx to capture a "paragraph". Instead, try to iterate the text line-by-line and use a little state machine to collect lines for every section number that's seen.

    Here's what I came up with, in Python:

    import re
    
    # Expect that section numbers delimit requirements.  Look for a section number to be:
    #  line-start, followed by some number of a digit and a period, followed by an optional space
    #  e.g.: '1. ', '1.1.2. ', '1.9.9.9.9.9. '
    Sect_no = re.compile(r"^(\d\.){1,} ?")
    
    sections = []
    with open("copy-pasted.txt") as txt_file:
        section_lines = []  # intialize empty array
    
        for line in txt_file:
            line = line.strip()
    
            if line == "":
                continue
    
            if Sect_no.match(line):
                if section_lines:  # ignore intial "empty section_lines
                    sections.append(section_lines)  # append last set of section lines
                section_lines = []  # reset for this new section
    
            section_lines.append(line)
    
    # capture last section
    if section_lines:
        sections.append(section_lines)
    

    Running that against the copy-pasted text gives me this two-dimensional array of lines, split up by section:

    [['Specification for Project'],
     ['1. Lorem ipsum dolor sit amet consectetur adipiscing elit.'],
     ['1.1. Pellentesque a sodales arcu, sed feugiat nibh. Pellentesque at fermentum odio, a molestie',
      'lorem. Ut eleifend sagittis porta.'],
     ['1.1.1. Integer sit amet consectetur erat. Duis sit amet urna quam. Pellentesque turpis tortor,',
     ...
      'quis purus. Cras vitae dui fringilla libero posuere varius at et velit.',
      'Specs That Those',
      'Spec 1 High 2.1m',
      'Spec 2 Low 0',
      'Nunc magna urna, sagittis sit amet interdum quis, finibus non dui. In pharetra risus tincidunt',
     ...
     ['3. Nunc eget maximus dolor. Integer orci purus, ultrices quis fringilla sit amet, blandit non erat.',
     ...
    

    The machine can use some work, like filtering out 'Specification for Project'; it will also pick up any other lines like headers, footers, or page counts.

    From here I'll extract the section numbers, "reconstitute" the lines into paragraphs, and save it all to a CSV:

    import csv
    
    Row = {"Section No.": None, "Section paragraphs": None}
    
    rows = []
    for section_lines in sections:
    
        line0 = section_lines[0]
        match = Sect_no.match(line0)
    
        if not match:  # ignore intial header, or other first line that isn't a section
            continue
    
        sect_no = match.group(0).strip()
    
        # intialize paragraphs (likely multiple paras) with first line, minus section number
        paragraphs = line0.replace(sect_no, "").strip()
    
        # build up section's paragraphs
        # (still don't know what an actual sentence is, or where one para ends and another (or a table) begins)
        for line in section_lines[1:]:
            paragraphs += "\n" + line
    
        # copy Row template and save to list of rows
        row = dict(Row)
        row["Section No."] = sect_no
        row["Section paragraphs"] = paragraphs
        rows.append(row)
    
    with open("requirements.csv", "w", newline="") as csv_out:
        writer = csv.DictWriter(csv_out, fieldnames=Row)
        writer.writeheader()
        writer.writerows(rows)
    

    When I run that, my requirements.csv looks something like the following:

    +-------------+----------------------------------------------------+
    | Section No. | Section paragraphs                                 |
    +-------------+----------------------------------------------------+
    | 1.          | Lorem ipsum dolor sit amet consectetur adipisci... |
    +-------------+----------------------------------------------------+
    | 1.1.        | Pellentesque a sodales arcu, sed feugiat nibh. ... |
    |             | lorem. Ut eleifend sagittis porta.                 |
    +-------------+----------------------------------------------------+
    | 1.1.1.      | Integer sit amet consectetur erat. Duis sit ame... |
    |             | porttitor eget egestas in, tristique in urna. C... |
    |             | conubia nostra, per inceptos himenaeos. Etiam e... |
    |             | viverra, condimentum ex in, elementum est. Inte... |
    |             | in.                                                |
    +-------------+----------------------------------------------------+
    | 1.1.2.      | Interdum et malesuada fames ac ante ipsum primi... |
    |             | ante consequat scelerisque. Donec non leo lorem... |
    |             | condimentum. Aenean a tellus augue. Nullam veli... |
    |             | quis purus. Cras vitae dui fringilla libero pos... |
    |             | Specs That Those                                   |
    |             | Spec 1 High 2.1m                                   |
    |             | Spec 2 Low 0                                       |
    ...