Search code examples
pythonregextexttext-extraction

Regular expression to extract chunks of text from a text file?


I need to extract headings and the chunk of text beneath them from a text file in Python using regular expression but I'm finding it difficult.

I converted this PDF to text so that it now looks like this:

img

So far I have been able to get all the numerical headers (12.4.5.4, 12.4.5.6, 13, 13.1, 13.1.1, 13.1.12) using the following regex:

import re

with open('data/single.txt', encoding='UTF-8') as file:

    for line in file:
        headings = re.findall(r'^\d+(?:\.\d+)*\.?', line)
        print(headings)`

I just don't know how to get the worded part of those headings or the paragraph of text beneath them.

EDIT - Here is the text:

I.S. EN 60601-1:2006&A1:2013&AC:2014&A12:2014

60601-1 © IEC:2005 60601-1 © IEC:2005

– 337 – – 169 –

12.4.5.4 Other ME EQUIPMENT producing diagnostic or therapeutic radiation When applicable, the MANUFACTURER shall address in the RISK MANAGEMENT PROCESS the RISKS associated with ME EQUIPMENT producing diagnostic or therapeutic radiation other than for diagnostic X-rays and radiotherapy (see 12.4.5.2 and 12.4.5.3).

Compliance is checked by inspection of the RISK MANAGEMENT FILE.

12.4.6 Diagnostic or therapeutic acoustic pressure When applicable, the MANUFACTURER shall address in the RISK MANAGEMENT PROCESS the RISKS associated with diagnostic or therapeutic acoustic pressure.

Compliance is checked by inspection of the RISK MANAGEMENT FILE.

13 * HAZARDOUS SITUATIONS and fault conditions

13.1 Specific HAZARDOUS SITUATIONS

  • General

13.1.1 When applying the SINGLE FAULT CONDITIONS as described in 4.7 and listed in 13.2, one at a time, none of the HAZARDOUS SITUATIONS in 13.1.2 to 13.1.4 (inclusive) shall occur in the ME EQUIPMENT.

The failure of any one component at a time, which could result in a HAZARDOUS SITUATION, is described in 4.7.

  • Emissions, deformation of ENCLOSURE or exceeding maximum temperature

13.1.2 The following HAZARDOUS SITUATIONS shall not occur: – emission of flames, molten metal, poisonous or ignitable substance in hazardous

quantities;

– deformation of ENCLOSURES to such an extent that compliance with 15.3.1 is impaired; –

temperatures of APPLIED PARTS exceeding the allowed values identified in Table 24 when measured as described in 11.1.3; temperatures of ME EQUIPMENT parts that are not APPLIED PARTS but are likely to be touched, exceeding the allowable values in Table 23 when measured and adjusted as described in 11.1.3;

– exceeding the allowable values for “other components and materials” identified in Table 22 times 1,5 minus 12,5 °C. Limits for windings are found in Table 26, Table 27 and Table 31. In all other cases, the allowable values of Table 22 apply.

Temperatures shall be measured using the method described in 11.1.3.

The SINGLE FAULT CONDITIONS in 4.7, 8.1 b), 8.7.2 and 13.2.2, with regard to the emission of flames, molten metal or ignitable substances, shall not be applied to parts and components where: – The construction or the supply circuit limits the power dissipation in SINGLE FAULT

CONDITION to less than 15 W or the energy dissipation to less than 900 J.


Solution

  • You could use your pattern and match a space after it followed by the rest of the line.

    Then repeat matching all following lines that do not start with a heading.

    ^\d+(?:\.\d+)* .*(?:\r?\n(?!\d+(?:\.\d+)* ).*)*
    
    • ^\d+(?:.\d+)* Your pattern to match a heading followed by a space
    • .* Match any char except a newline 0+ times
    • (?: Non capturing group
      • \r?\n Match a newline
      • (?! Negative lookahead, assert what is directly to the right is not
        • \d+(?:.\d+)* The heading pattern
      • ) Close lookahead
      • .* Match any char except a newline 0+ times
    • )* Close the non capturing group and repeat 0+ times to match all the lines

    Regex demo