Search code examples
regexbashsedpattern-matchingediting

How to Exit Sectional Pattern Matching in Sed if Any One of Several Other Section Headers is Encountered?


I'm using sed to inline edit a specific entry in a specific section of an open-standard multi-section, space-separated file which encodes certain numeric constants.

I have a working expression to do this, but I additionally want it to bail out if it reaches another section heading without finding a match for the inner pattern, as the sections could in theory according to the standard be out of order and the label/pattern I'm looking for could match other sections of the file.

An abstracted version of the file specification begins with section headers given as some list of header keyword strings, i.e. PLANES, THE_TRAINS,AN_AUTOMOBILE, BUSES``SUBMARINES. To be identified the header keyword string must be at the start of the line and must be followed by a whitespace character (space or tab). There may be additional space separated section-specific parameters on that line or the next line, although most sections don't have them. Blank lines are ignored, and thus may be used to improve readability, but can not be assumed. Anything after '!' or '*' is assumed to be a comment. Within a section a set of constants for some given combination of N common attribute keywords (e.g. small, medium, big, huge) is defined by the numeric constants (e.g. ## or ##.###) that follow. The attribute keywords are used across multiple sections, but can't be guaranteed to be found within a specific section.

An example is:

*
* Header comments
*

PLANES
!
! COMMENTS
!
BIG MEDIUM ##.### ##.###
BIG SMALL ##.### ##.###
...
SMALL SMALL ##.### ##.###
THE_TRAINS
!
! COMMENTS
!
MEDIUM MEDIUM SMALL ##.### ##.### ! COMMENT STUFF
MEDIUM SMALL SMALL ##.### ##.### ! COMMENT STUFF
...
BIG BIG BIG ##.### ##.###



AN_AUTOMOBILE 0.1 shift red
!
! COMMENTS
SMALL SMALL SMALL SMALL ##.### ##.### ##.###
SMALL MEDIUM SMALL SMALL ##.### ##.### ##.### ! COMMENT STUFF
...
BIG BIG BIG SMALL ##.### ##.### ##.###

BUSES
SMALL ##.### ##.### ## !
MEDIUM ##.### ##.### ## !
...
LARGE ##.### ##.### ## !
SUBMARINES
SMALL ##.### ##.### ## !
MEDIUM ##.### ##.### ## !
...
HUGE ##.### ##.### ## !

Anything after a * or a ! is considered a comment in the file standard.

Sections are defined by encountering a keyword followed by a space. After that there may be section specific stuff variables (see shift in edited example), but eventually every section has a list of numeric constants preceded by some set of N identifiers, which are common to all the sections.

White space between sections or between lines within a section is arbitrary and can be added for readability, but can not be assumed.

If the ordering is the same as my current file, the pattern:

sed -i '/SUBMARINES/{:keep_reading;n; /^MEDIUM.*$/!bkeep_reading s/^MEDIUM.*$/DERP/ }' file.dat

...works.

In case my intended action is unclear from the above expression my goal is to replace some pattern (i.e. ^MEDIUM.*$) within the subsection headed by some given keyword (i.e. SUBMARINES[ \t]). In the example I simply replace the whole match line with DERP. In the real implementation I'd do an implementation specific substitution, but I already know how to do that and its details are superfluous to the topic of how to use the built-in micro-language in sed to try to reach that line, exiting if other subsections are encountered without a match found in the target subsection.

But again it will likely break if the sections are out of order (i.e. if I try to replace HUGE in BUSES, it will continue on to the next subsection, SUBMARINES and replace that one, as it isn't found in the given section)

How do I bail out if I encounter any of these other section headings/subheadings (i.e. PLANES, BUSES, AN_AUTOMOBILE, and THE_TRAINS) after I've encountered the given section-heading keyword followed by a space/tab (i.e. SUBMARINES[ \t])?

This would prevent replacing the line beginning with HUGE in SUBMARINES when my intent was to only replace the line beginning with HUGE if it was found in BUSES.


Edit 1:

I think something like:

sed -i '/BUSES/{:keep_reading;n; /^HUGE.*$/!bkeep_reading /PLANES/\|/THE_TRAINS/\|/AN_AUTOMOBILE/\|/SUBMARINES/q s/^HUGE.*$/DERP/g }' file.dat

... could work, but that expression gives the error:

sed: -e expression #1, char 60: unknown command: `\'


Edit 2:

I have a semi-working solution:

sed -i '/BUSES/{:keep_reading;n; /^PLANES[ \t]\|^THE_TRAINS[ \t]\|^AN_AUTOMOBILE[ \t]\|^BUSES[ \t]/q; /^HUGE.*$/!bkeep_reading;  s/^HUGE.*$/DERP/g; }' file.dat

But I realize now that both of my previous solutions would actually delete any lines after HUGE when inline editing. I didn't realize this because the label I was matching happened to be the last line in the file.

The above pattern exits correctly, but truncates the remainder of the file. This seems like a simple fix -- how do I leave the rest of the file as is?


Also, given this additional syntax is there a better tool to use from the command line (i.e. perl, python, etc?)


Solution

  • Before finding the comment after the answer by Kenavoz

    If you want to change lines that start LBL_B1 to read DERP in blocks beginning SUB_HEADING_II or SUB_HEADING_IV (but not SUB_HEADING_III), then this does the job in any version of sed (though it doesn't overwrite the original file):

     sed '/^SUB_HEADING_I[IV]$/,/^$/ s/^LBL_B1.*/DERP/'
    

    For lines in the range of subheading II or IV (I used coincidental compactness of notation) up to a blank line (or EOF), replace any instance of LBL_B1 at the start of a line (plus anything after it) with DERP.

    If the sub-headings are more diverse, then:

    sed -e '/^SUB_HEADING_IV$/,/^$/ s/^LBL_B1.*/DERP/' \
        -e '/^DIVERSITY_REIGNS$/,/^$/ s/^LBL_B1.*/DERP/'
    

    If you activate extended regular expressions (-r in GNU sed, -E in BSD or Mac OS X sed), then you could use (BSD notation, but the only difference here is -E vs -r):

    sed -E '/^(SUB_HEADING_IV|DIVERSITY_REIGNS)$/,/^$/ s/^LBL_B1.*/DERP/'
    

    This assumes that there are no comments on the sub-heading lines. If comments are possible, you have to work harder on the regex identifying the start lines:

    sed -E '/^(SUB_HEADING_IV|DIVERSITY_REIGNS)( *!.*)?$/,/^$/ s/^LBL_B1.*/DERP/'
    

    I'm not clear if * can be used to start a 'tail comment'; if so, replace ! with [!*].

    After finding the comment after the answer by Kenavoz

    The subheading is distinguished by some small set of keywords. The file format specifies that white space lines are simply ignored, so you can't count on them being there or not being there. To make matters a bit more confusing one of the subheading keywords does have stuff after it (sort of like general settings for that group of things). But the basic rule of thumb is the section starts as soon as a line beginning with a particular keyword followed by a space is encountered and ends when another keyword followed by a space is encountered or the EOF is encountered.

    Given the revised specification for the start of the next section, you need the extended regex capability (or support for \| alternation in the basic regexes), and you will need to replace the /^$/ notation for the end of a section with an alternative such as:

     sed -E '/^(SUB_HEADING_II|SUB_HEADING_IV)$/,/^(SUB_HEADING_I|SUB_HEADING_II|SUB_HEADING_III|SUB_HEADING_IV)$/ {
             s/^LBL_B1.*/DERP/; }'
    

    The semicolon is required by BSD sed; GNU sed doesn't mind whether it is present or not. If there are more than about 4 sub-headings, I'd probably 'generate' the end marker using a Bash array:

    SH=( "SUB_HEADING_I" "THE_AUTOMOBILE" "A_SUBMARINE" "SUB_HEADING_II"
         "TRANSVERSE_COGITATION" "DIAMETRICALLY_OPPOSED" "SUB_HEADING_III"
         "CODSWALLOP" "SUB_HEADING_IV"
       )
    EH="$(IFS="|"; echo "/^(${SH[*]})\$/")"
    sed -E '/^(SUB_HEADING_II|SUB_HEADING_IV)( *[!*].*)?$/,'"$EH"' s/^LBL_B1.*/DERP/'
    

    Note that the use of ${SH[*]} rather than ${SH[@]} is crucial to this working, as is the semicolon.

    There is one (probably major) problem with this. Once a subsection heading has been used to mark the end of a previous section, it cannot be used as the start of another subsection, so if you need to edit two consecutive subsections edited, you again have to work harder. Depending on your portability requirements, I'd look at awk or Perl or Python, probably. It's easier to manage this sort of work in those languages than in sed. If the blank lines (or other fixed end-of-subsection marker) were required, then sed is able to handle the process well.

    Of course, if you merely need the script to work on your one machine, or on one set of machines that all have essentially the same setup (same version of sed on it), you can use the platform-specific idiosyncrasies to suit yourself. If you work in multiple environments, it helps to be aware when you are using a platform-specific feature. It may still be the correct thing to do — as long as you're aware of the issues that you will face when moving to other environments (or, at least, that there will be issues to face). It won't catch you by surprise, and you will do testing before trying to use the code in production on the new environment.


    After another update to the main question

    …and some code in comments…

    You had a problem with recognizing the section headers due to spaces, and the EH (end header was my mnemonic, though it's not particularly good) was not allowing for optional material after the heading keyword. I think this code works correctly.

    script.sh

    SH=( "PLANES" "THE_TRAINS" "AN_AUTOMOBILE" "BUSES" "SUBMARINES" )
    EH="$(IFS="|"; echo "/^(${SH[*]})([ !*].*)?$/")"
    sed -E '/^BUSES([ !*].*)?$/,'"$EH"' s/^HUGE.*/DERP/' data
    
    SH=( "PLANES" "THE_TRAINS" "AN_AUTOMOBILE" "BUSES" "SUBMARINES" )
    EH="$(IFS="|"; echo "/^(${SH[*]})([ !*].*)?\$/")"
    sed -E '/^SUBMARINES([ !*].*)?$/,'"$EH"' s/^HUGE.*/DERP/' data
    

    The SH and EH lines are supposed to be the same in both command sequences. The marginally interesting part is the sed script. In each case, the start pattern is a keyword with ([ !*].*)?$ matching nothing or a comment or a blank and tags up to the end of line. Similarly, the same regex fragment is used after the list of sub-section heading keywords in the assignment to EH and hence in the second part of the range in sed.

    Example run:

    $ bash -x script.sh
    + '[' -f /etc/bashrc ']'
    + . /etc/bashrc
    ++ '[' -z '' ']'
    ++ return
    + alias 'r=fc -e -'
    + SH=("PLANES" "THE_TRAINS" "AN_AUTOMOBILE" "BUSES" "SUBMARINES")
    ++ IFS='|'
    ++ echo '/^(PLANES|THE_TRAINS|AN_AUTOMOBILE|BUSES|SUBMARINES)([ !*].*)?$/'
    + EH='/^(PLANES|THE_TRAINS|AN_AUTOMOBILE|BUSES|SUBMARINES)([ !*].*)?$/'
    + sed -E '/^BUSES([ !*].*)?$/,/^(PLANES|THE_TRAINS|AN_AUTOMOBILE|BUSES|SUBMARINES)([ !*].*)?$/ s/^HUGE.*/DERP/' data
    *
    * Header comments
    *
    
    PLANES
    !
    ! COMMENTS
    !
    BIG MEDIUM ##.### ##.###
    BIG SMALL ##.### ##.###
    ...
    SMALL SMALL ##.### ##.###
    THE_TRAINS
    !
    ! COMMENTS
    !
    MEDIUM MEDIUM SMALL ##.### ##.### ! COMMENT STUFF
    MEDIUM SMALL SMALL ##.### ##.### ! COMMENT STUFF
    ...
    BIG BIG BIG ##.### ##.###
    
    
    
    AN_AUTOMOBILE 0.1 shift red
    !
    ! COMMENTS
    SMALL SMALL SMALL SMALL ##.### ##.### ##.###
    SMALL MEDIUM SMALL SMALL ##.### ##.### ##.### ! COMMENT STUFF
    ...
    BIG BIG BIG SMALL ##.### ##.### ##.###
    
    BUSES
    SMALL ##.### ##.### ## !
    MEDIUM ##.### ##.### ## !
    ...
    LARGE ##.### ##.### ## !
    SUBMARINES
    SMALL ##.### ##.### ## !
    MEDIUM ##.### ##.### ## !
    ...
    HUGE ##.### ##.### ## !
    + SH=("PLANES" "THE_TRAINS" "AN_AUTOMOBILE" "BUSES" "SUBMARINES")
    ++ IFS='|'
    ++ echo '/^(PLANES|THE_TRAINS|AN_AUTOMOBILE|BUSES|SUBMARINES)([ !*].*)?$/'
    + EH='/^(PLANES|THE_TRAINS|AN_AUTOMOBILE|BUSES|SUBMARINES)([ !*].*)?$/'
    + sed -E '/^SUBMARINES([ !*].*)?$/,/^(PLANES|THE_TRAINS|AN_AUTOMOBILE|BUSES|SUBMARINES)([ !*].*)?$/ s/^HUGE.*/DERP/' data
    *
    * Header comments
    *
    
    PLANES
    !
    ! COMMENTS
    !
    BIG MEDIUM ##.### ##.###
    BIG SMALL ##.### ##.###
    ...
    SMALL SMALL ##.### ##.###
    THE_TRAINS
    !
    ! COMMENTS
    !
    MEDIUM MEDIUM SMALL ##.### ##.### ! COMMENT STUFF
    MEDIUM SMALL SMALL ##.### ##.### ! COMMENT STUFF
    ...
    BIG BIG BIG ##.### ##.###
    
    
    
    AN_AUTOMOBILE 0.1 shift red
    !
    ! COMMENTS
    SMALL SMALL SMALL SMALL ##.### ##.### ##.###
    SMALL MEDIUM SMALL SMALL ##.### ##.### ##.### ! COMMENT STUFF
    ...
    BIG BIG BIG SMALL ##.### ##.### ##.###
    
    BUSES
    SMALL ##.### ##.### ## !
    MEDIUM ##.### ##.### ## !
    ...
    LARGE ##.### ##.### ## !
    SUBMARINES
    SMALL ##.### ##.### ## !
    MEDIUM ##.### ##.### ## !
    ...
    DERP
    $
    

    Some portability notes

    These were originally comments to a now-deleted answer.

    Things like alternation with \| are not universal across versions of sed. See the POSIX specification of sed and its link to Basic Regular Expressions for a standard (lowest common denominator) definition of sed. Note that -i (and -r, -E, and \|) are not standard. The \| notation is not (documented as) supported in BSD sed as meaning alternation.

    You can activate extended regular expressions with -E, and then plain | means alternation, but you then have to worry about other backslash sequences (\(, \{ and the closes \) and \}) lose their backslash (or the backslash now means the literal character rather than the extended meaning).

    The semantics of the -i option are different between GNU and BSD. The only portable notation between the two has the form -i.bak (provide backups with extension .bak — the name used is selectable, but it must be a non-empty string such as .bak). To get the in-situ backup in GNU sed, you use -i with no extension attached; in BSD sed, you use -i '' (a separate argument that is the empty string). A non-empty suffix can be attached (-i.bak) or detached (-i .bak) in BSD sed; GNU sed requires that it is attached.