My engineering team is gearing up for a bidding on a public project, where the specifications document is huge (~500 pages). I would like to break it down clause by clause in a spreadsheet and then assign the teams the relevant "portion". I checked, but PDF document is the only way these specs are provided.
The idea is to record it such that we can compare it with specifications of previous projects that are recorded in similar manner. I am still a trainee, so am not aware how this process works around different companies, but here in my team, the last project was documented manually in a similar manner.
The pages are arranged in indexed paragraphs (as 1, 1.1, 1.1.1 etc) with some tables and figures thrown around.
I hope to get a table like this:
Clause No. | Clause Para |
---|---|
1.1. | Lorem Ipsum |
1.2. | Lorem Ipsum |
I asked around on PM Stackexchange if someone had some idea regarding any software suite, but I don't think there are many.
So I turned to R hoping that I could solve this maybe by parsing it using pdftools and a regex, and generally, while checking the code, I can get it to run on regex101.com to some extent (randomly selects few paragraphs, but fails when encounters a table) but somehow it does not return the same response when used with R.
I have no commitment to use R, but it is just that it was easilty available on my work laptop. Willing to try python or any other toolkit as well.
So far, I have been stuck on getting to make R get a single paragraph.
library(pdftools)
library(dplyr)
library(stringr)
library(purrr)
setwd("The work Dir/")
specDoc <- pdf_text("Spec Doc.pdf") %>% strsplit(split = "\n")
specDocChar <- as.character(specDoc)
get_clause <- str_trim(str_extract(specDocChar, "(?:^\n*(?:\\d\\.(?:\\d\\.)*)+)(.+?)$"))
get_clause
I tried the lookbehind
also, but it seems to not work with flexible starting string lengths.
At this point I wish to know two things mainly.
As I've mentioned in our discussion/chat, this will be difficult and certainly imperfect.
I've tried running your sample PDF through the following automatic extractors:
pdfminer-six's pdf2txt.py CLI tool
poppler's pdftotext CLI tool, which I believe the R library you're using, pdftools, is based on
and they both produced the same text, which completely loses the original structure:
1.
1.1.
1.1.1.
1.1.2.
Lorem ipsum dolor sit amet consectetur adipiscing elit.
Pellentesque a sodales arcu, sed feugiat nibh. Pellentesque at fermentum odio, a molestie
lorem. Ut eleifend sagittis porta.
Integer sit amet consectetur erat. Duis sit amet urna quam. Pellentesque turpis tortor,
porttitor eget egestas in, tristique in urna. Class aptent taciti sociosqu ad litora torquent per
conubia nostra, per inceptos himenaeos. Etiam eleifend tincidunt volutpat. Curabitur eu enim
viverra, condimentum ex in, elementum est. Integer blandit arcu ex, at interdum orci viverra
in.
Now, the real PDF may be composed differently and the extractors may do better (🤞). But trying to move on...
The best could do with that LoremIpsumSpecs.pdf sample was just open it in Acrobat Reader, Edit → More → Select All, then copy-paste into a text editor to get something like the following:
Specification for Project
1. Lorem ipsum dolor sit amet consectetur adipiscing elit.
1.1. Pellentesque a sodales arcu, sed feugiat nibh. Pellentesque at fermentum odio, a molestie
lorem. Ut eleifend sagittis porta.
1.1.1. Integer sit amet consectetur erat. Duis sit amet urna quam. Pellentesque turpis tortor,
...
quis purus. Cras vitae dui fringilla libero posuere varius at et velit.
Specs That Those
Spec 1 High 2.1m
Spec 2 Low 0
Nunc magna urna, sagittis sit amet interdum quis, finibus non dui. In pharetra risus tincidunt
...
3. Nunc eget maximus dolor. Integer orci purus, ultrices quis fringilla sit amet, blandit non erat.
...
which preserves the structure of the section numbers and paragraphs, as well as the table.
Does that resemble the text you're getting in your R script?
If so, I would avoid trying to write one RegEx to capture a "paragraph". Instead, try to iterate the text line-by-line and use a little state machine to collect lines for every section number that's seen.
Here's what I came up with, in Python:
import re
# Expect that section numbers delimit requirements. Look for a section number to be:
# line-start, followed by some number of a digit and a period, followed by an optional space
# e.g.: '1. ', '1.1.2. ', '1.9.9.9.9.9. '
Sect_no = re.compile(r"^(\d\.){1,} ?")
sections = []
with open("copy-pasted.txt") as txt_file:
section_lines = [] # intialize empty array
for line in txt_file:
line = line.strip()
if line == "":
continue
if Sect_no.match(line):
if section_lines: # ignore intial "empty section_lines
sections.append(section_lines) # append last set of section lines
section_lines = [] # reset for this new section
section_lines.append(line)
# capture last section
if section_lines:
sections.append(section_lines)
Running that against the copy-pasted text gives me this two-dimensional array of lines, split up by section:
[['Specification for Project'],
['1. Lorem ipsum dolor sit amet consectetur adipiscing elit.'],
['1.1. Pellentesque a sodales arcu, sed feugiat nibh. Pellentesque at fermentum odio, a molestie',
'lorem. Ut eleifend sagittis porta.'],
['1.1.1. Integer sit amet consectetur erat. Duis sit amet urna quam. Pellentesque turpis tortor,',
...
'quis purus. Cras vitae dui fringilla libero posuere varius at et velit.',
'Specs That Those',
'Spec 1 High 2.1m',
'Spec 2 Low 0',
'Nunc magna urna, sagittis sit amet interdum quis, finibus non dui. In pharetra risus tincidunt',
...
['3. Nunc eget maximus dolor. Integer orci purus, ultrices quis fringilla sit amet, blandit non erat.',
...
The machine can use some work, like filtering out 'Specification for Project'; it will also pick up any other lines like headers, footers, or page counts.
From here I'll extract the section numbers, "reconstitute" the lines into paragraphs, and save it all to a CSV:
import csv
Row = {"Section No.": None, "Section paragraphs": None}
rows = []
for section_lines in sections:
line0 = section_lines[0]
match = Sect_no.match(line0)
if not match: # ignore intial header, or other first line that isn't a section
continue
sect_no = match.group(0).strip()
# intialize paragraphs (likely multiple paras) with first line, minus section number
paragraphs = line0.replace(sect_no, "").strip()
# build up section's paragraphs
# (still don't know what an actual sentence is, or where one para ends and another (or a table) begins)
for line in section_lines[1:]:
paragraphs += "\n" + line
# copy Row template and save to list of rows
row = dict(Row)
row["Section No."] = sect_no
row["Section paragraphs"] = paragraphs
rows.append(row)
with open("requirements.csv", "w", newline="") as csv_out:
writer = csv.DictWriter(csv_out, fieldnames=Row)
writer.writeheader()
writer.writerows(rows)
When I run that, my requirements.csv looks something like the following:
+-------------+----------------------------------------------------+
| Section No. | Section paragraphs |
+-------------+----------------------------------------------------+
| 1. | Lorem ipsum dolor sit amet consectetur adipisci... |
+-------------+----------------------------------------------------+
| 1.1. | Pellentesque a sodales arcu, sed feugiat nibh. ... |
| | lorem. Ut eleifend sagittis porta. |
+-------------+----------------------------------------------------+
| 1.1.1. | Integer sit amet consectetur erat. Duis sit ame... |
| | porttitor eget egestas in, tristique in urna. C... |
| | conubia nostra, per inceptos himenaeos. Etiam e... |
| | viverra, condimentum ex in, elementum est. Inte... |
| | in. |
+-------------+----------------------------------------------------+
| 1.1.2. | Interdum et malesuada fames ac ante ipsum primi... |
| | ante consequat scelerisque. Donec non leo lorem... |
| | condimentum. Aenean a tellus augue. Nullam veli... |
| | quis purus. Cras vitae dui fringilla libero pos... |
| | Specs That Those |
| | Spec 1 High 2.1m |
| | Spec 2 Low 0 |
...