Search code examples
pythonregextextregex-lookaroundsdata-processing

Extracting substring from PDF raw text using regex


I was trying to extract the subsection that has the roman indexing from a pdf document.

For instance this is one section of the document,

\n1.1\n \nSCOPE\n \nThis PTS specifies the\n \nrequirements \nand recommendations for Classification, Verification \n\nFunct\nions.\n \nThe scope includes the following:\n \ni.\n \nSemi\n-\nquantitative SIL classification\n \nii.\n \nSpurious trip analysis\n \niii.\n \nProbabilistic and architectural SIL verification\n \niv.\n \nRecommendations\n \nfor SIL gap closure'

what I want is only below:

This PTS specifies the\n \nrequirements \nand recommendations for Classification, Verification \n\nFunct\nions.\n \nThe scope includes the following:\n \ni.\n \nSemi\n-\nquantitative SIL classification\n \nii.\n \nSpurious trip analysis\n \niii.\n \nProbabilistic and architectural SIL verification\n \niv.\n \nRecommendations\n \nfor SIL gap closure

I need the sentence before the roman indexing as well as the content inside roman indexing.

However, there are also cases like below

3.1.3\n \nDo\nc\numentation\n \nrequired\n \nT\nh\ne\n \nl\nat\ne\ns\nt\n \nissue\n \nof\n \nt\nh\ne\n \nf\no\nllo\nw\ni\nng\n \ndocume\nn\nts\n \nshall\n \nbe\n \nav\na\nilab\nl\ne\n \nto\n \nthe\n \nte\na\nm\n \np\ne\nrf\no\nrm\ni\nng\n \nt\nh\ne \nc\nl\nass\ni\nf\ni\ncati\no\nn:\n \ni.\n \nMandatory reference document\n \na)\n \nCause and effect matrices (CEM)\n \nb)\n \nPiping and Instrument Diagram (P&ID) or Process and utility engineering \nflow schemes (PEFS)\n \nc)\n \nHAZOP report\n \nd)\n \nIPF reliability data\n \nii.\n \nOther reference document\n \na)\n \nProcess Flow Diagram (PFD) or Process Fl\now Scheme (PFS)\n \nb)\n \nPlant layout drawing\n \nc)\n \nProcess safeguarding flow schemes (PSFS)\n \nd)\n \nControl narratives\n \ne)\n \nInterlocks/ ESD logic diagram\n \nf)\n \nEquipment layout diagram\n \ng)\n \nMaintenance and Inspection Data\n \nh)\n \nPlant historian data\n \n \nT\nh\ne\n \nl\ni\ns\nt\n \na\nb\no\nve\n \nis\n \nn\no\nt\n \ne\nx\nh\na\nu\nsti\nv\ne. Any\n \not\nh\ne\nr\n \ndo\nc\nu\nm\ne\nn\nt\ns\n/ \nd\nr\na\nw\nin\ng\ns\n \nreq\nu\nir\ne\nd\n \nf\no\nr\n \nt\nhe \nc\nom\np\nletion\n \no\nf the\n \nIPF\n \ns\nt\nu\nd\ny\n \ns\nh\na\nll\n \nbe\n \nf\nu\nr\nn\nished\n \nas\n \na\nn\nd\n \nw\nhen\n \nre\nq\nui\nr\ne\nd\n.\n \n

I've converted the pdf into raw text and I've managed to extract section of the document. The

regx = re.compile( '\.\n \n.+?:\n \n',re.DOTALL)
find = str(txt)
indexhead.append((regx.findall(find)))

The above code can only extract the headline but not the roman indexing together

.\n \nThe scope includes the following:\n \n

I'm trying to extract based on the pattern but I'm thinking maybe some conditional rules might help.


Solution

  • If I understand the problem right, we would want to just take out the roman indices, and get the entire paragraph, which we would start with a simple expression such as:

    .+[0-9]\.?.+?([A-Z][a-z].*)
    

    then as new cases come up, we would just use logical ORs and add additional rules.

    Demo

    Test

    # coding=utf8
    # the above tag defines encoding for this document and is for Python 2.x compatibility
    
    import re
    
    regex = r".+[0-9]\.?.+?([A-Z][a-z].*)"
    
    test_str = ("\\n1.1\\n \\nSCOPE\\n \\nThis PTS specifies the\\n \\nrequirements \\nand recommendations for Classification, Verification \\n\\nFunct\\nions.\\n \\nThe scope includes the following:\\n \\ni.\\n \\nSemi\\n-\\nquantitative SIL classification\\n \\nii.\\n \\nSpurious trip analysis\\n \\niii.\\n \\nProbabilistic and architectural SIL verification\\n \\niv.\\n \\nRecommendations\\n \\nfor SIL gap closure'\n\n"
        "3.1.3\\n \\nDo\\nc\\numentation\\n \\nrequired\\n \\nT\\nh\\ne\\n \\nl\\nat\\ne\\ns\\nt\\n \\nissue\\n \\nof\\n \\nt\\nh\\ne\\n \\nf\\no\\nllo\\nw\\ni\\nng\\n \\ndocume\\nn\\nts\\n \\nshall\\n \\nbe\\n \\nav\\na\\nilab\\nl\\ne\\n \\nto\\n \\nthe\\n \\nte\\na\\nm\\n \\np\\ne\\nrf\\no\\nrm\\ni\\nng\\n \\nt\\nh\\ne \\nc\\nl\\nass\\ni\\nf\\ni\\ncati\\no\\nn:\\n \\ni.\\n \\nMandatory reference document\\n \\na)\\n \\nCause and effect matrices (CEM)\\n \\nb)\\n \\nPiping and Instrument Diagram (P&ID) or Process and utility engineering \\nflow schemes (PEFS)\\n \\nc)\\n \\nHAZOP report\\n \\nd)\\n \\nIPF reliability data\\n \\nii.\\n \\nOther reference document\\n \\na)\\n \\nProcess Flow Diagram (PFD) or Process Fl\\now Scheme (PFS)\\n \\nb)\\n \\nPlant layout drawing\\n \\nc)\\n \\nProcess safeguarding flow schemes (PSFS)\\n \\nd)\\n \\nControl narratives\\n \\ne)\\n \\nInterlocks/ ESD logic diagram\\n \\nf)\\n \\nEquipment layout diagram\\n \\ng)\\n \\nMaintenance and Inspection Data\\n \\nh)\\n \\nPlant historian data\\n \\n \\nT\\nh\\ne\\n \\nl\\ni\\ns\\nt\\n \\na\\nb\\no\\nve\\n \\nis\\n \\nn\\no\\nt\\n \\ne\\nx\\nh\\na\\nu\\nsti\\nv\\ne. Any\\n \\not\\nh\\ne\\nr\\n \\ndo\\nc\\nu\\nm\\ne\\nn\\nt\\ns\\n/ \\nd\\nr\\na\\nw\\nin\\ng\\ns\\n \\nreq\\nu\\nir\\ne\\nd\\n \\nf\\no\\nr\\n \\nt\\nhe \\nc\\nom\\np\\nletion\\n \\no\\nf the\\n \\nIPF\\n \\ns\\nt\\nu\\nd\\ny\\n \\ns\\nh\\na\\nll\\n \\nbe\\n \\nf\\nu\\nr\\nn\\nished\\n \\nas\\n \\na\\nn\\nd\\n \\nw\\nhen\\n \\nre\\nq\\nui\\nr\\ne\\nd\\n.\\n \\n")
    
    subst = "\\1"
    
    # You can manually specify the number of replacements by changing the 4th argument
    result = re.sub(regex, subst, test_str, 0, re.MULTILINE)
    
    if result:
        print (result)
    
    # Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.
    

    RegEx

    If this expression wasn't desired, it can be modified/changed in regex101.com.

    RegEx Circuit

    jex.im visualizes regular expressions:

    enter image description here