Search code examples
pythonnlpspacy

Regular Expression and Rule Based Matcher to extract legal citations title and volume


I am trying to extract case title, volume and pages from inconsistence legal documents. I am using two algorithms, regex to and spaCy rule based matching with Entity and POS tags (still learning this...). I am getting over half of the citations with regex (thanks to answer code below) but zero with spaCy. My code is

import re
import en_core_web_sm
nlp = en_core_web_sm.load()

nlp = spacy.load('en_core_web_sm')

from spacy.matcher import Matcher
m_tool = Matcher(nlp.vocab)

doc = open(file='text1.txt', mode='r', encoding='utf-8').read()
#print(text)

doc = nlp(doc)
#print([(ent.text, ent.label_) for ent in doc.ents])


p1 = [{'IS_TITLE': 'NN'}, {'LOWER': 'v'}, {'IS_PUNCT': True}, {'IS_TITLE': 'NN'}]
p2 = [{'IS_TITLE': 'NN'}, {'IS_TITLE': 'NN'}, {'LOWER': 'v'}, {'IS_PUNCT': True}, {'IS_TITLE': 'NN'}]
p3 = [{'IS_TITLE': 'NN'}, {'LOWER': 'v'}, {'IS_PUNCT': True}, {'IS_TITLE': 'NN'}, {'IS_TITLE': 'NN'},]
p4 = [{'IS_TITLE': 'NN'}, {'IS_TITLE': 'NN'}, {'LOWER': 'v'}, {'IS_PUNCT': True}, {'IS_TITLE': 'NN'}, {'IS_TITLE': 'NN'}]
p5 = [{'IS_TITLE': 'NN'}, {'IS_TITLE': 'NN'}, {'IS_TITLE': 'NN'}, {'LOWER': 'v'}, {'IS_PUNCT': True}, {'IS_TITLE': 'NN'}, {'IS_TITLE': 'NN'}, {'IS_TITLE': 'NN'}]
p6 = [{'IS_TITLE': 'NN'}, {'IS_TITLE': 'NN'}, {'LOWER': 'v'}, {'IS_PUNCT': True}, {'IS_TITLE': 'NN'}, {'IS_TITLE': 'NN'}, {'IS_TITLE': 'NN'}]
p7 = [{'IS_TITLE': 'NN'}, {'LOWER': 'v'}, {'IS_PUNCT': True}, {'IS_TITLE': 'NN'}, {'IS_TITLE': 'NN'}, {'IS_TITLE': 'NN'}]
p8 = [{'IS_TITLE': 'NN'}, {'IS_TITLE': 'NN'}, {'IS_TITLE': 'NN'}, {'LOWER': 'v'}, {'IS_PUNCT': True}, {'IS_TITLE': 'NN'}]
p9 = [{'IS_TITLE': 'NN'}, {'IS_TITLE': 'NN'}, {'IS_TITLE': 'NN'}, {'LOWER': 'v'}, {'IS_PUNCT': True}, {'IS_TITLE': 'NN'}, {'IS_TITLE': 'NN'}]
p10 = [{'label': 'PERSON'}]
P11 = [{'label': 'ORG'}, {'label': 'PERSON'}]
p12 = [{'label': 'PERSON'}, {'label': 'ORG'}]
p13 = [{'label': 'ORG'}, {'label': 'ORG'}, {'label': 'ORG'}, {'label': 'ORG'}]

m_tool.add('QBF', None, p1, p2, p3, p4, p5, p6, p6, p7, p8, p9, p10, p11, p12, p13)

phrase_matches = m_tool(doc)
print(phrase_matches)

matches = re.findall(r'(?:[A-Z]\w*\.? )+v\. .*?\d{4}\)', contents)
for match in matches:
    print(match)

My text1 looks like

text1 = "material fact challenged. Brill v. Guardian Life Ins. Co. of America, 142 N.J. 520, 529 (1995)
(emphasis original).
When a movant establishes certain facts, those who would oppose the motion are under See Della v. Guard Lifal Ins. Co. of SA, 142 N.J. 420, 549 (2011)
an obligation to come forward with controverting facts. Heljon Mgmt. Corp. v. DiLeo, 55 N.J.
Super. 306, 312-13 (No Citations. This was extracted from NJ Sup..). Mere assertions and allegations in the pleadings are
insufficient to defeat motions for summary judgment. Ocean Cape Hotel Corp. v. Masefield
Corp., 63 N.J. Super. 369, 383 (App. Div. 1960). Where the party opposing summary
 "

I am expecting all matches with both algorithmns,

"Brill v. Guardian Life Ins. Co. of America, 142 N.J. 520, 529 (1995)"
"Della v. Guard Lifal Ins. Co. of SA, 142 N.J. 420, 549 (2011)"
"Heljon Mgmt. Corp. v. DiLeo, 55 N.J. Super. 306, 312-13 (No Citations. This was extracted from NJ Sup..)"
"Ocean Cape Hotel Corp. v. Masefield Corp., 63 N.J. Super. 369, 383 (App. Div. 1960)"

Solution

  • I am not sure if it will work in all cases, but you can try this:

    matches = re.findall(r"(?:[A-Z]\w*\.? )+v\. .*?\d{4}\)", contents)
    

    It gives:

    ['Brill v. Guardian Life Ins. Co. of America, 142 N.J. 520, 529 (1995)',
     'Heljon Mgmt. Corp. v. DiLeo, 55 N.J. Super. 306, 312-13 (App. Div. 1959)',
     'Ocean Cape Hotel Corp. v. Masefield Corp., 63 N.J. Super. 369, 383 (App. Div. 1960)']