Search code examples
pythonregexpython-re

How can I extract special part of a text which is separated by "----" using regex in Python?


'----
Airport SPQU :S16:20:25.6431  W071:34:22.3800  8338ft
Country Name="Peru"
State Name=""
City Name="Arequipa"
Airport Name="Rodriguez Ballon"
in file: ORBX\FTX_VECTOR\FTX_VECTOR_AEC\scenery\AEC_SPQU.bgl
----
Airport SPRF :S14:15:59.9484  W070:27:59.9997  14419ft
Country Name="Peru"
State Name=""
City Name="San Rafael"
Airport Name="San Rafael"
in file: Scenery\0304\scenery\APX29370.bgl
Start 12 : S14:15:40.9653  W070:28:38.3900  14419ft Hdg: 117.0T, Length 8760ft 
Start 30 : S14:16:18.9314  W070:27:21.6092  14419ft Hdg: 297.0T, Length 8760ft 
0120 Lat -14.261198 Long -70.477715 Alt 14419 Hdg 120 Len 8760 Wid 98
0300 Lat -14.272106 Long -70.455620 Alt 14419 Hdg 300 Len 8760 Wid 98
----
Airport TNCB :N12:08:25.5567  W068:16:34.3503  20ft
Country Name="Netherlands Antilles"
State Name=""
City Name="Bonaire I"
Airport Name="Flamingo"
in file: Scenery\0303\scenery\APX29270.bgl
Start 10 : N12:08:23.2891  W068:17:16.0525  20ft Hdg: 92.0T, Length 9448ft 
Start 28 : N12:08:20.1144  W068:15:43.9767  20ft Hdg: 272.0T, Length 9448ft 
0100 Lat 12.139818 Long -68.288246 Alt 20 Hdg 100 Len 9448 Wid 148
0280 Lat 12.138905 Long -68.261757 Alt 20 Hdg 280 Len 9448 Wid 148
----
Airport TNCC :N12:11:20.0649  W068:57:34.8897  29ft
Country Name="Netherlands Antilles"
State Name=""
City Name="Curacao I"
Airport Name="Willemstad-Hato Intl."
in file: Scenery\0303\scenery\APX29270.bgl
Start 11 : N12:11:30.5607  W068:58:24.9607  29ft Hdg: 102.1T, Length 11186ft 
Start 29 : N12:11:08.2410  W068:56:38.2654  29ft Hdg: 282.1T, Length 11186ft 
0110 Lat 12.191923 Long -68.974129 Alt 29 Hdg 111 Len 11186 Wid 197 ILS 111.90, Flags: GS DME BC
0290 Lat 12.185513 Long -68.943428 Alt 29 Hdg 291 Len 11186 Wid 197
----
Airport TNCE :N17:29:32.4738  W062:58:29.8992  129ft
Country Name="Netherlands Antilles"
State Name=""
City Name="St Eustatius I"
Airport Name="F.D. Roosevelt"
in file: ORBX\FTX_OLC\FTX_VECTOR_FixedAPT\scenery\APT_TNCE.BGL
Start 6 : N17:29:35.1949  W062:59:02.6666  129ft Hdg: 50.3T, Length 4268ft 
Start 24 : N17:30:00.9808  W062:58:30.1439  129ft Hdg: 230.2T, Length 4268ft 
0060 Lat 17.492956 Long -62.984272 Alt 129 Hdg 63 Len 4268 Wid 98
0240 Lat 17.500425 Long -62.974819 Alt 129 Hdg 243 Len 4268 Wid 98
----
Airport TNCM :N18:02:27.0378  W063:06:34.2595  13ft
Country Name="Netherlands Antilles"
State Name=""
City Name="St Maarten I"
Airport Name="Princess Juliana Intl"
in file: Scenery\0303\scenery\APX31250.bgl
Start 9 : N18:02:21.9843  W063:07:08.8215  13ft Hdg: 81.7T, Length 7150ft 
Start 27 : N18:02:31.8322  W063:05:57.8823  13ft Hdg: 261.7T, Length 7150ft 
0090 Lat 18.039392 Long -63.119469 Alt 13 Hdg 95 Len 7150 Wid 148
0270 Lat 18.042223 Long -63.099060 Alt 13 Hdg 275 Len 7150 Wid 148
----'

This is part of my text. I am trying to extract this part :

'----
Airport TNCB :N12:08:25.5567  W068:16:34.3503  20ft
Country Name="Netherlands Antilles"
State Name=""
City Name="Bonaire I"
Airport Name="Flamingo"
in file: Scenery\0303\scenery\APX29270.bgl
Start 10 : N12:08:23.2891  W068:17:16.0525  20ft Hdg: 92.0T, Length 9448ft 
Start 28 : N12:08:20.1144  W068:15:43.9767  20ft Hdg: 272.0T, Length 9448ft 
0100 Lat 12.139818 Long -68.288246 Alt 20 Hdg 100 Len 9448 Wid 148
0280 Lat 12.138905 Long -68.261757 Alt 20 Hdg 280 Len 9448 Wid 148
----'

I tried this regex pattern however it extracts from the beginning to end of where I want to extract:

----.+?TNCB.+?----

As I said, it extracts from the beginning till the end of expected result. The important thing is it checks the occurrence of "----" once after the matched string "TNCB" but it doesn't extract once before that string. How can I fix that ? How can I arrange it so that it cuts from the first 4 of "-" before "TNCB" ?

import re

airport_tuple =  ('TNCB','RPUJ','00IS','WALQ')

def read_text():
    with open("symbols.txt","r") as f:
        list_of_strings = f.readlines()
        text = " ".join(list_of_strings)
    return text

def main():
    text = read_text()
    print(re.findall(r"(?m)^----\n(Airport\s+TNCB.*(?:\n.*)*?)\n----", text))
    

if __name__ == "__main__":
    main()

Solution

  • You can use

    (?m)^----\n(Airport\s+TNCB.*(?:\n.*)*?)\n----
    

    See the regex demo.

    Details:

    • (?m)^ - start of a line ((?m) is equal to re.M/re.MULTILINE)
    • ----\n - ---- and a newline
    • (Airport\s+TNCB.*(?:\n.*)*?) - Group 1:
      • Airport\s+TNCB - Airport, one or more whitespaces, TNCB
      • .* - the rest of the line
      • (?:\n.*)*? - zero or more occurrences (as few as possible) of a newline and then the rest of the line
    • \n---- - a newline and ---- substring.

    In Python, you can use

    re.findall(r'^----\n(Airport\s+TNCB.*(?:\n.*)*?)\n----', text, re.M)