Search code examples
regexawksedgrepgawk

Delete all lines which don't match a pattern


I am looking for a way to delete all lines that do not follow a specific pattern (from a txt file).

Pattern which I need to keep the lines for:

x//x/x/x/5/x/

x could be any amount of characters, numbers or special characters.

5 is always a combination of alphanumeric - 5 characters - e.g Xf1Lh, always appears after the 5th forward slash.

/ are actual forward slashes.

Input:

abc//a/123/gds:/4AdFg/f3dsg34/ y35sdf//x/gd:df/j5je:/x/x/x yh//x/x/x/5Fsaf/x/ 45wuhrt//x/x/dsfhsdfs54uhb/ 5ehys//srt/fd/ab/cde/fg/x/x

Desired output:

abc//a/123/gds:/4AdFg/f3dsg34/ yh//x/x/x/5Fsaf/x/


Solution

  • grep selects lines according to a regular expression and your x//x/x/x/5/x/ just needs minor changes to make it into a regular expression:

    $ grep -E '.*//.*/.*/.*/[[:alnum:]]{5}/.*/' file
    abc//a/123/gds:/4AdFg/f3dsg34/
    yh//x/x/x/5Fsaf/x/
    

    Explanation:

    • "x could be any amount of characters, numbers or special characters". In a regular expression that is .* where . means any character and * means zero or more of the preceding character (which in this case is .).

    • "5 is always a combination of alphanumeric - 5 characters". In POSIX regular expressions, [[:alnum:]] means any alphanumeric character. {5} means five of the preceding. [[:alnum:]] is unicode-safe.

    Possible improvements

    One issue is how x should be interpreted. In the above, x was allowed to be any character. As triplee points out, however, another reasonable interpretation is that x should be any character except /. In that case:

    grep -E '[^/]*//[^/]*/[^/]*/[^/]*/[[:alnum:]]{5}/[^/]*/' file
    

    Also, we might want this regex to match only complete lines. In that case, we can either surround the regex with ^ an $ or we can use grep's -x option:

    grep -xE '[^/]*//[^/]*/[^/]*/[^/]*/[[:alnum:]]{5}/[^/]*/' file