Search code examples
pythonregextext-miningtext-extractionstring-parsing

Python extract paragraph in text file using regex


I am using Python 3.7 and I am trying to extract some paragraph from some text files using regex.

Here is a sample of the txt file content.

AREA: OMBEYI MARKET, ST. RITA RAMULA

DATE: Thursday 25.03.2021, TIME: 9.00 A.M. = 5.00 P.M.

Ombeyi Mk, Kiliti Mkt, Masogo Mkt, Miwani, Kasongo, Onyango Midika, St. Rita Ramula, Onyalo
Biro, Yawo Pri, Obino, Rutek, Keyo Pri & adjacent customers.

AREA: NYAMACHE FACTORY

DATE: Thursday 25.03.2021, TIME: 830 A.M. - 3.00 P.M.

Nyamache Fact, Suguta, Gionseri, Igare, Kionduso, Nyationgongo, Enchoro, Kebuko, Emenwa, Maji
Mazuri, Borangi & adjacent customers.

AREA: SUNEKA MARKET, RIANA MARKET

DATE: Thursday 25.03.2021, TIME: 8.00 A.M. - 3.00 P.M.

Suneka Mk, Riana Mk, Kiabusura, Gesonso, Chisaro, Sugunana, Nyamira Ndogo & adjacent
customers.

AREA: ITIATI, GITUNDUTI

DATE: Thursday 25.03.2021, TIME: 9.00 A.M. = 2.00 P.M.

General China, Gachuiro, Gathuini Pri, Itiati Campus, Kianjugum, Gikore, Kihuri TBC, Gitunduti &
adjacent customers.

Currently I am able to extract the Area, Date and Time using regex:

area_pattern = re.compile("^AREA:((.*))")
date_pattern = re.compile("^DATE:(.*),")
time_pattern = re.compile("TIME:(.*).")

I would like to be able to extract the paragraph after DATE/TIME and before AREA containing locations separated by commas. So I will be able to match the following:

1.
Ombeyi Mk, Kiliti Mkt, Masogo Mkt, Miwani, Kasongo, Onyango Midika, St. Rita Ramula, Onyalo
Biro, Yawo Pri, Obino, Rutek, Keyo Pri & adjacent customers.

2.
Nyamache Fact, Suguta, Gionseri, Igare, Kionduso, Nyationgongo, Enchoro, Kebuko, Emenwa, Maji
Mazuri, Borangi & adjacent customers.

3.
Suneka Mk, Riana Mk, Kiabusura, Gesonso, Chisaro, Sugunana, Nyamira Ndogo & adjacent
customers.

4.
General China, Gachuiro, Gathuini Pri, Itiati Campus, Kianjugum, Gikore, Kihuri TBC, Gitunduti &
adjacent customers.

If anyone could help with suggesting a regex that would help with this use case, as well as improvements to my current regex, I would really appreciate it. Thanks


Solution

  • You may use this regex with a capture group to be used in re.findall:

    \nDATE:.*\n*((?:\n.*)+?)(?=\nAREA:|\Z)
    

    RegEx Demo

    RegEx Details:

    • \nDATE:: Match text DATE: after matching a line break
    • .*\n*: Match rest of the line followed by 0 or more line breaks
    • ((?:\n.*)+?): Capture group 1 to capture our text which 1 or lines of everything until next condition is satisfied
    • (?=\nAREA:|\Z): Assert that we have a line break followed by AREA: or end of input right ahead of the current position