Search code examples
pythonregextextmarkdowntext-extraction

How to extract text for "# Heading level 1" (header and its paragraphs) from markdown string/document with python?


I need to extract the text (header and its paragraphs) that match a header level 1 string passed to the python function. Below an example mardown text where I'm working:

# My first header

## Nec sic igni ad ad aventi

Lorem markdownum quantumque nunc, fine superi sagittis, haut regalis attollo,
ora inferius, mensor deam? Sedili quoque tauri. Quo limite ducem.

1. Arva fecit partes tosta
2. Insignia est ausae ut ut ait
3. O summa saepe

Sic ipsos, Phlegethontide nisi poterat neque quos tum partes rapitur. Filius
utraque: glande, ut exiles terram fiducia coeunt. Et caelo legit multis,
plangorem altoque; et iamque nec. Sanguine corpora prora quicquid insolida in
Parin: stupet est posses nos mater temptat, gemit num.

# My second header

## Primordia metuam his dixerat talaria cognoscenda

Lorem markdownum revulsum dilexit contra. Qui seu supplex Themis profuit quoque
Hyperionis, omnibus aesculus signa medendi. Aspiciunt rigidique finibus ducunt
postquam, huic postera lignum, properent.

- Nostro purgamina capitque longis
- Virtus suo moenibus
- Byblida longum pudibunda referre
- Via in ab vulneribus petita mirantur quamquam
- Et vela
- Nondum sacer meminisse Dircen novas dumque

For example I need to extract all the text of the header "My second header" from the above text.

I'm trying with regular expression but I didn't found a coorect rule for solve my problem.

def findHeader("My second header")
r = re.compile(r"the regular expression")
    print(r.findall(text))

findHeader output:

# My second header

## Primordia metuam his dixerat talaria cognoscenda

Lorem markdownum revulsum dilexit contra. Qui seu supplex Themis profuit quoque
Hyperionis, omnibus aesculus signa medendi. Aspiciunt rigidique finibus ducunt
postquam, huic postera lignum, properent.

- Nostro purgamina capitque longis
- Virtus suo moenibus
- Byblida longum pudibunda referre
- Via in ab vulneribus petita mirantur quamquam
- Et vela
- Nondum sacer meminisse Dircen novas dumque

Solution

  • This does the job:

    import re
    
    text = """
    # My first header
    
    ## Nec sic igni ad ad aventi
    
    Lorem markdownum quantumque nunc, fine superi sagittis, haut regalis attollo,
    ora inferius, mensor deam? Sedili quoque tauri. Quo limite ducem.
    
    1. Arva fecit partes tosta
    2. Insignia est ausae ut ut ait
    3. O summa saepe
    
    Sic ipsos, Phlegethontide nisi poterat neque quos tum partes rapitur. Filius
    utraque: glande.
    
    # My second header
    
    ## Primordia metuam his dixerat talaria cognoscenda
    
    Lorem markdownum revulsum dilexit contra. Qui seu supplex Themis profuit quoque
    Hyperionis, omnibus aesculus signa medendi.
    
    - Nostro purgamina capitque longis
    - Virtus suo moenibus
    
    # My third header
    
    ## Primordia metuam his dixerat talaria cognoscenda
    
    Lorem markdownum revulsum dilexit contra. Qui seu supplex Themis profuit quoque
    postquam, huic postera lignum, properent.
    
    """
    def findHeader(search):
        r = re.compile(r"(?<!#)# " + search + r"(?s)(?:(?!(?<!#)# ).)+")
        return(r.findall(text))
        
    print(findHeader("My second header"))
    

    Output:

    ['# My second header\n\n## Primordia metuam his dixerat talaria cognoscenda\n\nLorem markdownum revulsum dilexit contra. Qui seu supplex Themis profuit quoque\nHyperionis, omnibus aesculus signa medendi.\n\n- Nostro purgamina capitque longis\n- Virtus suo moenibus\n\n']
    

    Explanation:

    r"          # raw string
        (?<!#)      # negative lookbehind, make sure we haven't a # before
        #           # a # and a space
    "           # end string
    +           # concat
        search      # header to be searched
    +           # concat
    r"          # raw string
        (?s)        # . matches newline
        (?:         # non capture group (Tempered greedy token)
            (?!         # negative lookahead, mmake sure we haven't after:
                (?<!#)      # negative lookbehind, make sure we haven't a # before
                #           # a # and a space
            )           # end lookahead
            .           # any character including newline
        )+          # end group, may appear 1 or more times
    "           # end string