Search code examples
javaregexalgorithmparsingplaintext

Parsing data from structured document with ambiguous labeling


I am attempting to move legal documents from ancient SGML files into a database. Using regular expressions in java, I have had pretty good luck. However, I have run into a slight problem. It appears that the labeling for each part of the document is not standard between documents. For instance, the most common labeling is:

(<numeric>)
    (<alpah>)
        (<ROMAN>)
            (<ALPHA>)

Ex. (1)(a)(I)(A)

However, there are other documents that have variations, with the possibility of () being thrown in there. My current algorithm has hard coded RegExs that match each element at each level. But I need a way to dynamically set the label type for each level as I work my way through the document.

Has anyone come across an issue like this? Does anyone have any suggestions?

Thanks in advance.

EDIT:

Here are the RegExs I use to parse out the different items:

Section: ^<tab>(<b>)?\d{1,4}(\.\d+)?-((\d{1,4}(\.\d+)?)(-|\.)?){3}
SubSection: \.?\s*(<\/b>|<tab>|^)\s*\(\d+(\.\d+)?\)\s+($|<b>|[A-Z"]|\([a-z](.\d+)?\)\s*(\((XC|XL|L?X{0,3})(IX|IV|V?I{0,3})(\.\d+)?\)\s*(\([A-Z](.\d+)?\))?)?\s*.)
Paragraph: (^|<tab>|\s+|\(\d+(\.\d+)?\)\s+)\([a-z](.\d+)?\)(\s+$|\s+<b>|\s+[A-Z"]|\s*\((XC|XL|L?X{0,3})(IX|IV|V?I{0,3})(\.\d+)?\)(\([A-Z](.\d+)?\))?\s*[A-Z"]?)
SubParagraph: (\)|<tab>|<\/b>)\s*\((XC|XL|L?X{0,3})(IX|IV|V?I{0,3})(\.\d+)?\)\s+($|[A-Z"<]|\([A-Z](.\d+)?\)\s*[A-Z"])
SubSubParagraph: (<tab>|\)\s*)\([A-Z](.\d+)?\)\s+([A-Z"]|$)

And here's some example text. I misspoke earlier. Although the ultimate source of the data is in SGML, I am parsing something slightly different. Besides having style tags, it is more or less plain text.

<tab><b>SECTION 5.</b>  In Colorado Revised Statutes, 13-5-142, <b>amend</b> (1)
introductory portion, (1)(b), and (3)(b)(II) as follows:

<tab><b>13-5-142.  National instant criminal background check system - reporting.</b>
(1)  On and after March 20, 2013, the state court administrator shall send electronically
the following information to the Colorado bureau of investigation created pursuant to
section 24-33.5-401, referred to in this section as the "bureau":

<tab>(b)  The name of each person who has been committed by order of the court to the
custody of the office of behavioral health in the department of human services pursuant
to section 27-81-112 or 27-82-108; and

<tab>(3)  The state court administrator shall take all necessary steps to cancel a record
made by the state court administrator in the national instant criminal background check
system if:

<tab>(b)  No less than three years before the date of the written request:

<tab>(II)  The period of commitment of the most recent order of commitment or
recommitment expired, or a court entered an order terminating the person's incapacity or
discharging the person from commitment in the nature of habeas corpus, if the record in
the national instant criminal background check system is based on an order of
commitment to the custody of the office of behavioral health in the department of human
services; except that the state court administrator shall not cancel any record pertaining to
a person with respect to whom two recommitment orders have been entered pursuant to
section 27-81-112 (7) and (8), or who was discharged from treatment pursuant to section
27-81-112 (11) on the grounds that further treatment is not likely to bring about
significant improvement in the person's condition; or

Solution

  • Your statement of the problem is vague, so the only possible answers are general approaches. I've worked on imprecisely formatted document conversions like this.

    A tool from CS that can help is a state machine. It's appropriate if you can detect (e.g. with regex) that the format is changing to new convention. This changes the state, which in this case is equivalent to the translator to use on the current and following chunks of text. It remains in effect until the next state change. Overall the algorithm looks like this:

    translator = DEFAULT 
    while (chunks of input remain) {
      chunk = GetNextChunkOfInput // a line, paragraph, etc.
      new_translator = ScanChunkForStateChange(chunk, translator)
      if (new_translator != null) translator = new_translator // found a state change!
      print(translator.Translate(chunk))  // use the translator on the chunk
    }
    

    Within this framework, it's a fiddly process to design translators and the state change predicate. All you can hope to do is try, examine output, and fix problems, repeating until you can't get any better. At that point you've probably discovered a maximal structure in the input, so algorithms that work with pattern matching alone (without trying to model semantics e.g. with an AI) won't get you too much farther.