Search code examples
pythonregextext

Any Regex question to split a summary from txt file


I'm trying to extract some content from txt file (from pdf conversion). You can notice space (or not) after label or before page_number and sometimes, there is no \n between page_number and X.Y.Z code Here is a sample :

Summary
...
30. LOT 03 CLOS COUVERT.................... 29 
30.1. LOT 03-1 ETANCHEITE .................... 29 
30.1.1. Travaux divers.................... 29 
30.1.1.1. Boite à eau.................... 29 
30.1.1.2. Descentes d’eaux pluviales en façades .................30 
30.1.1.3. Lanterneau de désenfumage...............30 30.1.1.4. Etanchéité résine..................31 
...

The structure of the summary is :

X.Y.Z. Label ....................... Page_number

And later in the same doc we can found description associated :

30. LOT 03 CLOS COUVERT (no description here)
30.1. LOT 03-1 ETANCHEITE (no description here)
...
30.1.1.1. Boite à eau 
Composition : -Descente d\’eaux pluviales 
o en zinc naturel, épaisseur 0.80 mm 
o en tle laquée jaune sur les zones d\’enduit jaune -Moignon cylindrique du diamètre de la descente EP -Trop-plein rectangulaire positionné sur face avant etc ...
30.1.1.2. Descentes d\’eaux pluviales en façades 
Fourniture et pose de descentes d\'eaux en zinc extérieures type VM Zinc
...

My use case is to put X.Y.Z Label into a python dict as keys from the summary only and associate descriptions to this. Expected outpout looks like this :

{ '30. LOT 03 CLOS COUVERT' : '',
'30.1. LOT 03-1 ETANCHEITE' : '',
...
'30.1.1.1. Boite à eau' : 'Composition : -Descente d’eaux pluviales 
o en zinc naturel, épaisseur 0.80 mm 
o en tle laquée jaune sur les zones d\’enduit jaune -Moignon cylindrique du diamètre de la descente EP -Trop-plein rectangulaire positionné sur face avant etc ...',
'30.1.1.2. Descentes d\’eaux pluviales en façades' : 'Fourniture et pose de descentes d\'eaux en zinc extérieures type VM Zinc...'}

I tried this regex which is the best result I can get but it's not the best :

(\d+[.])(.*)?[.]*\s*\d+

My problem is about dot managing, label extraction and \n missing.

Could you please healp me ?


Solution

  • You might use 2 capture groups, where the first group is the key and the second group is the description.

    If this is specifically about the dots that you want to omit:

    \b(\d+(?:\.\d+)*)\.[^\S\n]+([\s\S]*?)\.{2,}\s*\d+\b
    

    See a regex demo and a Python demo

    Example

    import pprint
    import re
     
    regex = r"\b(\d+(?:\.\d+)*)\.[^\S\n]+([\s\S]*?)\.{2,}\s*\d+\b"
     
    s = ("> Summary\n"
                "30.1.3.1. Boite à eau................................................................................................................................................. 29 \n"
                "30.1.3.2. Descentes d’eaux pluviales en façades ....................................................................................................30 \n"
                "30.1.3.3. Lanterneau de désenfumage.....................................................................................................................30 30.1.3.4. Etanchéité résine.......................................................................................................................................31")
     
    pprint.pprint(dict(re.findall(regex, s)))
    

    Output

    {'30.1.3.1': 'Boite à eau',
     '30.1.3.2': 'Descentes d’eaux pluviales en façades ',
     '30.1.3.3': 'Lanterneau de désenfumage',
     '30.1.3.4': 'Etanchéité résine'}
    

    If you want the whole part including the dots right before the page number, then the group 2 value matches until either a new key starts after a page number OR when it is the last description with a page number followed by the end of the string.

    \b(\d+(?:\.\d+)*)\.[^\S\n]+([\s\S]*?)\s*\b(?=\d+\s*$|\d+\s+\d+(?:\.\d+)*\.)
    

    See a regex demo and a Python demo