Search code examples
regexgrep

Regex for minimum number of appearances of capturing group on a line


I have a text file where each line consists of a series of numbers separated by spaces followed by a word. The numbers consist of only the digits 1 through 6 and the digits within each number are ordered and unique. The word at the end of the file is not important.

For example:

2356 345 12345 4 4 1 6 gripped
12346 2 2346 123456 2356 56 245 12346 13456 12456 misidentifies
1256 345 24 12456 12356 123456 12356 356 1256 5 26 swine

would all be valid lines within my file.

I need to write a grep command which uses a regex to match all lines which contain at least 8 numbers which have a 1 or a 6. That means the line 346 1245 136 23456 5 1356 123456 5 123456 123456 octettes is a match (346, 1245, 136, 23456, 1356, 123456, 123456, 123456 are 8 numbers) but the line 1 236 145 23 16 4 12356 4 3 packers is not a match (1, 236, 145, 16, 12356 are only 5 numbers).

Note: the regex does not have to match the full line. grep returns all lines with a match present somewhere so the only important part is having the minimum 8 matches.

I have constructed this regex: ((?:(?:123456)|(?:1[2-5]*)|(?:[2-5]*6)) ) It matches all of the numbers which match the condition and does not count 123456 twice. My issue is now with counting the number of occurrences. A {8,} would be sufficient if all of the numbers matches were one after another but sometimes there is one (or more) number in between the matches (e.g. 134 4 245 1245).

I have tried a lot of things including putting [2-5]{0,5}, [2-5]* or .* in the matching group to be repeated (with a {8,}) but nothing seemed to work. They are either not matching correctly or giving a catastrophic backtracking error.

I am pretty new to regex so I might have misunderstood how some things work. I know I need to modify my capturing group for my {8,} quantifier to work but I do not know how.

Regex101 link with more examples and my current (partial) solution here.


Solution

  • If there are single spaces only, you might use

    ^(?:(?:[2345]+ )*[2345]*[16][1-6]* ){7}(?:[2345]+ )*[2345]*[16]
    

    The pattern matches

    • ^ Start of string
    • (?: Non capture group to repeat as a whole
      • (?:[2345]+ )* Optionally match numbers without 1 or 6
      • [2345]*[16][1-6]* Match a number with 1 or 6
    • ){7} Close the non capture group and repeat 7 times
    • (?:[2345]+ )*[2345]*[16] The 8th match

    Regex demo

    Example using grep -E matching 1 or more spaces or tabs:

    grep -E "^(([2345]+[[:blank:]]+)*[2345]*[16][1-6]*[[:blank:]]+){7}([2345]+[[:blank:]]+)*[2345]*[16]" file
    

    Example using grep -P matching 1 or more horizontal whitespace characters:

    grep -P "grep -P "^(?>(?:[2345]+\h+)*+[2345]*[16][1-6]*\h+){7}(?:[2345]+\h+)*+[2345]*[16]" file" file