Search code examples
javaregexregex-greedy

Using regex to filter bunch of email addresses in text with some specific conditions


I'm experimenting with regex and I'm trying to filter out bunch of email addresses that are embedded in some text source. The filter process will be on two specific conditions:

  1. Every email starts with abc

  2. Regular email patter which includes an @ followed by a . and ending specifically in com

Source:

sajgvdaskdsdsds[email protected]sdksdhkshdsdk[email protected]wdgjkasdsdad

Pattern1 = "abc[\w\W][@][\w]\.com

code:

public class Test {

    /**
     * @param args the command line arguments
     */
    public static void main(String[] args)
    {
        boolean found = false;
        String source = "[email protected]@gmail.comwdgjkasdsdad";


        String pattern1 = "abc[\\w\\W]*[@][\\w]*\\.com";

        Pattern p1 = Pattern.compile(pattern1);
        Matcher m1 = p1.matcher(source);
        System.out.println("Source:\t" + source);
        System.out.println("Exprsn:\t" + m1.pattern());
        while (m1.find())
        {
            found = true;
            System.out.println("Pos: " + m1.start() + "\tFound: " + m1.group());
        }
        System.out.println();
        if(!found)
        {
            System.out.println("Nothing found!");
        }

    }

}

I'm expecting o/p as:

Pos: 15 Found: [email protected]

Pos: 48 Found: [email protected]

But getting:

Pos: 15 Found: [email protected]@gmail.com

If I use this Pattern2: abc[\\w]*[@][\\w]*\\.com then I'm getting the expected o/p. However, the thing is email address can contain non-word characters after abc and before @. (For example: [email protected]).

Hence Pattern2 doesn't work with non-word characters. So, I went with [\\w\\W]* instead of [\\w]*.

I also tried Pattern3: abc[\\w\\W][@][\\w]\\.com[^.] and still doesn't work.

Please help me, where am I doing wrong?


Solution

  • Regex operators are greedy by default, meaning that they will grab as much of the string as they can. [\w\W]* will grab all intervening @ characters except for the very last one.

    Either use the reluctant form of the operators (e.g. *? instead of *), or just simplify the expression:

    abc[^@]*@[^.]+\.com
    

    [^@] will take as many characters that aren't @ as it can find. Similarly [^.] will match everything until the first dot.

    Alternatively, you can use reluctant operators:

    abc.*?@.*?\.com