Search code examples

Pyparsing: extract variable length, variable content, variable whitespace substring

I need to extract Gleason scores from a flat file of prostatectomy final diagnostic write-ups. These scores always have the word Gleason and two numbers that add up to another number. Humans typed these in over two decades. Various conventions of whitespace and modifiers are included. Below is my Backus-Naur form so far, and two example records. Just for prostatectomies, we're looking at upwards of a thousand cases.

I am using pyparsing because I'm learning python, and have no fond memories of my very limited exposure to regex writing.

My question: how can I pluck out these Gleason grades without parsing every single other optional piece of data that may or may not be in these final diagnoses?

num = Word(nums)
record ::= accessionDate + accessionNumber + patMedicalRecordNum + finalDxText
accessionDate ::= num + "/" + num + "/" num
accessionNumber ::= "S" + num + "-" + num
patMedicalRecordNum ::= num + "/" + num + "-" + num + "-" + num
finalDxText ::= listOfParts + optionalComment + optionalpTNMStage
listOfParts ::= OneOrMore(part)
part ::= <multiline idiosyncratic freetext which may contain a Gleason score I want> + optionalpTNMStage
optionalComment ::= <multiline idiosyncratic freetext which may contain a Gleason score I don't want>
optionalpTNMStage ::= <multiline idiosyncratic freetext which may contain a Gleason score I don't want>

01/01/11  S11-55555 20/444-55-6666 A.  PROSTATE AND SEMINAL VESICLES, PROSTATECTOMY:                           
                                   -  ADENOCARCINOMA.                                                      

                                   TOTAL GLEASON SCORE:  GLEASON 5+4=9                                     
                                   TUMOR LOCATION:  BILATERAL                                              
                                   TUMOR QUANTITATION:  15% OF PROSTATE INVOLVED BY TUMOR                  
                                   EXTRAPROSTATIC EXTENSION:  PRESENT AT RIGHT POSTERIOR                   
                                   SEMINAL VESICLE INVASION:  PRESENT                                      
                                   MARGINS:  UNINVOLVED                                                    
                                   LYMPHOVASCULAR INVASION:  PRESENT                                       
                                   PERINEURAL INVASION:  PRESENT                                           
                                   LYMPH NODES (SPECIMENS B AND C):                                        
                                      NUMBER EXAMINED:  25                                                 
                                      NUMBER INVOLVED:  1                                                  
                                      DIAMETER OF LARGEST METASTASIS:  1.7 mm                              

                                   PATHOLOGIC STAGE:  pT3b N1 MX                                           

                               B.  LYMPH NODES, RIGHT PELVIC, EXCISION:                                    
                                   -  ONE OF SEVENTEEN LYMPH NODES POSITIVE FOR METASTASIS (1/17).         

                               C.  LYMPH NODES, LEFT PELVIC, EXCISION:                                     
                                   -  EIGHT LYMPH NODES NEGATIVE FOR METASTASIS (0/8).                     
01/02/11  S11-4444 20/111-22-3333 PROSTATE AND SEMINAL VESICLES, PROSTATECTOMY:                               
                                  - ADENOCARCINOMA.                                                        
                                    GLEASON SCORE:  3 + 3 = 6 WITH TERTIARY PATTERN OF 5.                                             
                                    TUMOR QUANTITATION:  APPROXIMATELY 10% BY VOLUME.                      
                                    TUMOR LOCATION:  BILATERAL.                                            
                                    EXTRAPROSTATIC EXTENSION:  NOT IDENTIFIED.                             
                                    MARGINS:  NEGATIVE.                                                    
                                    PERINEURAL INVASION:  IDENTIFIED.                                      
                                    LYMPH-VASCULAR INVASION:  NOT IDENTIFIED.                              
                                    SEMINAL VESICLE/VASA DEFERENTIA INVASION: NOT IDENTIFIED.              
                                    LYMPH NODES:  NONE SUBMITTED.                                          
                                    OTHER:  HIGH GRADE PROSTATIC INTRAEPITHELIAL NEOPLASIA.                
                               PATHOLOGIC STAGE (pTNM):  pT2c NX.                                       

Full disclosure: I'm a physician doing research; this is my first real work with python. I have read Lutz's Learning Python, Shaw's Learning Python the Hard Way, and worked through various problem sets. I have reviewed numerous pyparsing related questions on this forum, the pyparsing wiki, and I bought and read Mr McGuire's Getting Started with Pyparsing. Perhaps I am asking a question when I should really be told I am standing at "The death spiral of frustation that is so common when you have to write parsers" (McGuire, 17)? I don't know. So far I'm just happy to be working on what may actually be a real project.


  • Here is a sample to pull out the patient data and any matching Gleason data.

    from pyparsing import *
    num = Word(nums)
    accessionDate = Combine(num + "/" + num + "/" + num)("accDate")
    accessionNumber = Combine("S" + num + "-" + num)("accNum")
    patMedicalRecordNum = Combine(num + "/" + num + "-" + num + "-" + num)("patientNum")
    gleason = Group("GLEASON" + Optional("SCORE:") + num("left") + "+" + num("right") + "=" + num("total"))
    assert 'GLEASON 5+4=9' == gleason
    assert 'GLEASON SCORE:  3 + 3 = 6' == gleason
    patientData = Group(accessionDate + accessionNumber + patMedicalRecordNum)
    assert '01/02/11  S11-4444 20/111-22-3333' == patientData
    partMatch = patientData("patientData") | gleason("gleason")
    lastPatientData = None
    for match in partMatch.searchString(data):
        if match.patientData:
            lastPatientData = match
        elif match.gleason:
            if lastPatientData is None:
                print "bad!"
            print "{0.accDate}: {0.accNum} {0.patientNum} Gleason({1.left}+{1.right}={})".format(
                            lastPatientData.patientData, match.gleason


    01/01/11: S11-55555 20/444-55-6666 Gleason(5+4=9)
    01/02/11: S11-4444 20/111-22-3333 Gleason(3+3=6)