I need to extract the text (header and its paragraphs) that match a header level 1 string passed to the python function. Below an example mardown text where I'm working:
# My first header
## Nec sic igni ad ad aventi
Lorem markdownum quantumque nunc, fine superi sagittis, haut regalis attollo,
ora inferius, mensor deam? Sedili quoque tauri. Quo limite ducem.
1. Arva fecit partes tosta
2. Insignia est ausae ut ut ait
3. O summa saepe
Sic ipsos, Phlegethontide nisi poterat neque quos tum partes rapitur. Filius
utraque: glande, ut exiles terram fiducia coeunt. Et caelo legit multis,
plangorem altoque; et iamque nec. Sanguine corpora prora quicquid insolida in
Parin: stupet est posses nos mater temptat, gemit num.
# My second header
## Primordia metuam his dixerat talaria cognoscenda
Lorem markdownum revulsum dilexit contra. Qui seu supplex Themis profuit quoque
Hyperionis, omnibus aesculus signa medendi. Aspiciunt rigidique finibus ducunt
postquam, huic postera lignum, properent.
- Nostro purgamina capitque longis
- Virtus suo moenibus
- Byblida longum pudibunda referre
- Via in ab vulneribus petita mirantur quamquam
- Et vela
- Nondum sacer meminisse Dircen novas dumque
For example I need to extract all the text of the header "My second header" from the above text.
I'm trying with regular expression but I didn't found a coorect rule for solve my problem.
def findHeader("My second header")
r = re.compile(r"the regular expression")
print(r.findall(text))
findHeader output:
# My second header
## Primordia metuam his dixerat talaria cognoscenda
Lorem markdownum revulsum dilexit contra. Qui seu supplex Themis profuit quoque
Hyperionis, omnibus aesculus signa medendi. Aspiciunt rigidique finibus ducunt
postquam, huic postera lignum, properent.
- Nostro purgamina capitque longis
- Virtus suo moenibus
- Byblida longum pudibunda referre
- Via in ab vulneribus petita mirantur quamquam
- Et vela
- Nondum sacer meminisse Dircen novas dumque
This does the job:
import re
text = """
# My first header
## Nec sic igni ad ad aventi
Lorem markdownum quantumque nunc, fine superi sagittis, haut regalis attollo,
ora inferius, mensor deam? Sedili quoque tauri. Quo limite ducem.
1. Arva fecit partes tosta
2. Insignia est ausae ut ut ait
3. O summa saepe
Sic ipsos, Phlegethontide nisi poterat neque quos tum partes rapitur. Filius
utraque: glande.
# My second header
## Primordia metuam his dixerat talaria cognoscenda
Lorem markdownum revulsum dilexit contra. Qui seu supplex Themis profuit quoque
Hyperionis, omnibus aesculus signa medendi.
- Nostro purgamina capitque longis
- Virtus suo moenibus
# My third header
## Primordia metuam his dixerat talaria cognoscenda
Lorem markdownum revulsum dilexit contra. Qui seu supplex Themis profuit quoque
postquam, huic postera lignum, properent.
"""
def findHeader(search):
r = re.compile(r"(?<!#)# " + search + r"(?s)(?:(?!(?<!#)# ).)+")
return(r.findall(text))
print(findHeader("My second header"))
Output:
['# My second header\n\n## Primordia metuam his dixerat talaria cognoscenda\n\nLorem markdownum revulsum dilexit contra. Qui seu supplex Themis profuit quoque\nHyperionis, omnibus aesculus signa medendi.\n\n- Nostro purgamina capitque longis\n- Virtus suo moenibus\n\n']
Explanation:
r" # raw string
(?<!#) # negative lookbehind, make sure we haven't a # before
# # a # and a space
" # end string
+ # concat
search # header to be searched
+ # concat
r" # raw string
(?s) # . matches newline
(?: # non capture group (Tempered greedy token)
(?! # negative lookahead, mmake sure we haven't after:
(?<!#) # negative lookbehind, make sure we haven't a # before
# # a # and a space
) # end lookahead
. # any character including newline
)+ # end group, may appear 1 or more times
" # end string