Search code examples
bashgrepundefined-behavior

Wrong behavior in grep with @s


I was writing a small wrapper for nullmailer, when I noticed, imho, an unwanted behavior in grep. In particular I noticed something strange with @s.

It does break strings containing @ and will produce wrong output.

TL;DR

E-mail addresses have some rules to follow (E.G. RFC 2822), so I will use a deliberately wrong regular expression for them, just to keep things a bit shorter. Note that this will not change the problem I'm asking for.

I am using e-mail addresses in this post, but the problem is obviously for every string with at least a @ in it.

I wrote a small script to help me explain what I "found":

#!/bin/bash

funct1() {

  arr=([email protected] [email protected])
  regex="[[:alnum:]]*@[[:alpha:]]*\.[[:alpha:]]\{2,\}"
  for dest in ${arr[@]}; do
    printf "%s\n" "$dest" | grep -o -e "$regex"
  done
}
funct2() {
  arr=([email protected] [email protected])
  regex="[[:alpha:]]*@[[:alpha:]]*\.[[:alpha:]]\{2,\}"
  for dest in ${arr[@]}; do
    printf "%s\n" "$dest" | grep -o -e "$regex"
  done
}

funct3(){
  arr=(local1@[email protected] local2@[email protected])
  regex="[[:alpha:]]*@[[:alpha:]]*@[[:alpha:]]*\.[[:alpha:]]\{2,\}"
  for dest in ${arr[@]}; do
    printf "%s\n" "$dest" | grep -o -e "$regex"
  done
}

funct4(){
  arr=(local1@[email protected] local2@[email protected])
  regex="[[:alpha:]]*@[[:alnum:]]*@[[:alpha:]]*\.[[:alpha:]]\{2,\}"
  for dest in ${arr[@]}; do
    printf "%s\n" "$dest" | grep -o -e "$regex"
  done
}

printf "One @, all parts of regex right:\n"
funct1
printf "One @, first part of regex wrong:\n"
funct2
printf "Two @, first and second part of regex wrong:\n"
funct3
printf "Two @, first part of regex wrong:\n"
funct4
exit 0

To better understand the problem, I used two types of strings: [email protected] and local1@[email protected] and it seems to me that grep does not behave in the correct way with strings containing at least a @.

The output is:

One @, all parts of regex right:
[email protected]
[email protected]

One @, first part of regex wrong:
@domain.tld
@domain.tld

Two @, first and second part of regex wrong:

Two @, first part of regex wrong:
@[email protected]
@[email protected]

funct1 has a regular expression that solves the entire strings, so no problem, all of them are printed.

funct2 has a regular expression that solves only the strings from @ to the end, so what I should expect is no output, because of the wrong expression; instead, what I have is the second part of the strings...

That is why I decided to add the second @ in the string and do some tests.

funct3 solves only the strings from the second @ to the end, so what I should expect is no output at all because of the mistake in the regex; Ok, no output.

funct4 instead has a regular expression that solves only the strings from the first @ to the end, so what I should expect in here is that he can not show me anything; instead, what I have is the output from first @, just as funct2.

Except for funct1 I shouldn't have any output at all, I am right?

Why does grep break the result at the first @?

I consider it an unwanted behavior because this way the result will consists in strings that don't match my expression entirely.

Am I missing something?

EDIT: deleter tag undefined-behavior


Solution

  • Your regex has issues, working as designed. You could also just count the number of @ as a test as well. Personally I would create a boolean method like this :

    #!/bin/bash
    
    # -- is email address valid ? --    
    function isEmailValid() {
          echo "$1" | egrep -q "^([A-Za-z]+[A-Za-z0-9]*((\.|\-|\_)?[A-Za-z]+[A-Za-z0-9]*){1,})@(([A-Za-z]+[A-Za-z0-9]*)+((\.|\-|\_)?([A-Za-z]+[A-Za-z0-9]*)+){1,})+\.([A-Za-z]{2,})+"
    }
    
    
    if isEmailValid "_#@[email protected]" ;then
            echo "VALID "
    else
            echo "INVALID"
    fi
    
    
    if isEmailValid "[email protected]" ;then
            echo "VALID "
    else
            echo "INVALID"
    fi
    

    Or more simply:

    function isEmailValid() {
          regex="^([A-Za-z]+[A-Za-z0-9]*((\.|\-|\_)?[A-Za-z]+[A-Za-z0-9]*){1,})@(([A-Za-z]+[A-Za-z0-9]*)+((\.|\-|\_)?([A-Za-z]+[A-Za-z0-9]*)+){1,})+\.([A-Za-z]{2,})+"
          [[ "${1}" =~ $regex ]]
    }