Search code examples
regexregex-lookarounds

regex: matching an entire line with an exact number of repeating tokens


I'm trying to write a regex that can match ONLY those lines where every white-space separated token occurs exactly twice, regardless of order.

For example, the entirety of the following lines should match:

1 1 2 2
100 10 10 100
A B B A 
HELLO HELLO

The following lines should NOT match:

hello hello hello
1 1 22
1001

Although I'm able to match individual groups of repetitions in a given line, using the regex (\d+)(?=.*(\1)), I am struggling to match entire lines using ^$. My guess is that when I use the lookahead, this is creating an infinite loop where we constantly look at every token (including the repetitions) and expect a repeat later in the string, although I am not sure how to resolve this. Any ideas? Thanks!

[EDIT]: To add some details, based on questions from the comments: Obviously it is fairly trivial to implement this as a function in most programming languages. However,I was originally looking to implement this as a regex, because I was trying to match certain records in a database. Thus, this regex was intended to be embedded as a CASE statement in a SQL query, which in my mind would've been a nice way make the selection.

Given the apparent complexity of such a regex, it seems that creating a function is the way to go, and thus pretty much any of the following answers would be fine solutions, circumstances depending.


Solution

  • You could create a function to split the string and count the occurrences of each word in it, returning true if all the words have a count of 2:

    create function all_pairs(v text) returns bool as $$
      with counts as (
        select count(*) as c
        from unnest(string_to_array(v, ' ')) as vals(val)
        group by val
      ),
      arr as (
        select array_agg(c) as cc
        from counts
      )
      select 2 = all(cc)
      from arr
    $$ language sql
    

    Then you can simply call the function to test your string. For example:

    with values as (
       select '1 1 2 2' as v union all
       select '100 10 10 100' union all
       select 'A B B A' union all 
       select 'HELLO HELLO' union all
       select 'hello hello hello' union all
       select '1 1 22' union all
       select '1001'
    )
    select * 
    from values
    where all_pairs(v)
    

    Output:

    v
    1 1 2 2
    100 10 10 100
    A B B A
    HELLO HELLO
    

    Demo on dbfiddle.uk