Search code examples
regexcsv

Enclose Multiline text inside quotes for a specific Regular Expression Capture Group


My input is text from different sources and the only consistent thing is that they all contain a code, latitude and longitude. This is sometimes followed by useful notes.

From this input, the aim is to produce a CSV format with the header row: Code,Name,Latitude,Longitude,Notes,URL,Comments

Only the Code, Latitude, Longitude and possible Notes are required,

Using PCRE2, the substution string is Code,,Latitude,Longitude,"Notes",,

$1,,$2,$3,"$4",,

My RE is almost doing what I want

(?mi)^.*?((?:\bGC)[A-Z0-9-]{1,10}).*?([N|S]\s?\d{1,2}°?\s+\d{1,2}\.\d{1,3}'?).*?([E|W]\s?\d{1,3}°?\s+\d{1,2}\.\d{1,3}'?)\s(.*)

You can check my work so far in regex101.

Focusing on part of the output in the example

GC8DH0G,,N 50° 50.456',W 001° 10.456',"Text to capture",,
Including line breaks and text, until the next GCcode
GC123GHF,,N 50 50.789,W 001 10.789,"etc.",,

The only additional requirement is to include all the multilines of text that follow the Longitude, up until the following GC code, also enclosed in quoted capture group 4.

So the above output would become

GC8DH0G,,N 50° 50.456',W 001° 10.456',"Text to capture
Including line breaks and text, until the next GCcode",,
GC123GHF,,N 50 50.789,W 001 10.789,"etc.",,

That is, all the text, including new lines \n or \r\n all enclosed in quotes.


Solution

  • Have a try with this pattern, making use of PCRE freespacing mode and subroutine definitions. I borrowed your code for Latitude and Longitude definitions (changing ° to \x{00b0} as suggested by @Reilas).

    (?x) # make use of PCRE freespacing mode for better readability
    (?(DEFINE) # define capture group behavior
      (?P<code_pattern> # GC code pattern
        GC # GC ...
        # ... followed by capital letters, digits and "-" 1 trough 10 times
        [A-Z0-9-]{1,10} 
      ) # end code_pattern
      # pattern to match lat and long locations
      (?P<location_pattern>
        \s? # optional space
        \d{1,3} # one or two digits
        \x{00b0}? # unicode of "°"
        \s+ # any number of spaces
        \d{1,2} # 1 or 2 digits
        \. # literal "."
        \d{1,3} # 1 through 2 digits
        '?
      ) # end location_pattern
      (?P<latitude_pattern>
        [NS] # "N" or "S"
        (?P>location_pattern)
      )
      # longitude analogous to
      (?P<longitude_pattern>
        [EW] # "E" or "W"
        (?P>location_pattern)
      )
      # pattern checking if current line contains a code
      (?P<check_for_code>
        (?!.*(?P>code_pattern))
      )
      # pattern to match notes
      (?P<notes_pattern>
        # note is only eligible if line has no more code
        (?P>check_for_code)
        .* # matches line 
        (?: # possibly ...
          \n # ... match new lines ...
          (?P>check_for_code) # ... if they contain no code
          .* # matches line
        )*
      ) # end notes_pattern
    ) # end definitions
    
    #### ACTUAL PATTERN ####
    
    # match unwanted initial characters so they can be replaced by an empty string
    .* 
    (?P<Code>(?P>code_pattern)) # capture Code
    .*? # followed (lazily) by anything
    (?P<Latitude>(?P>latitude_pattern)) # capture Latitude
    .*? # followed (lazily) by anything
    (?P<Longitude>(?P>longitude_pattern)) # capture Longitude
    \s? # by optional space
    (?P<Notes>(?P>notes_pattern))? # capture Notes (optional)
    

    replace with:

    $Code,,$Latitude,$Longitude,"$Notes",,\n
    

    See https://regex101.com/r/2a7AG7/latest

    If you need to get rid of empty lines in CSV further substitute ^$\n with an empty string.

    See https://regex101.com/r/Mj6rH7/latest