Search code examples
javaregexregex-greedyregex-alternation

Priority in regex manipulating


I write some java code to split string into array of string. First, I split that string using regex pattern "\\,\\,|\\," and then I split using pattern "\\,|\\,\\,". Why there are difference between output of the first and output of the second?

public class Test2 {
    public static void main(String[] args){

        String regex1 = "\\,\\,|\\,";
        String regex2 = "\\,|\\,\\,"; 

        String a  = "20140608,FT141590Z0LL,0608103611018634TCKJ3301000000018667,3000054789,IDR1742630000001,80507,1000,6012,TCKJ3301,6.00E+12,ID0010015,WADORI PURWANTO,,3000054789";
        String ss[] = a.split(regex1); 

        int index = 0; 
        for(String m : ss){
            System.out.println((index++)+ ": "+m+"|"); 
        }
    }
} 

Output when using regex1:

0: 20140608|
1: FT141590Z0LL|
2: 0608103611018634TCKJ3301000000018667|
3: 3000054789|
4: IDR1742630000001|
5: 80507|
6: 1000|
7: 6012|
8: TCKJ3301|
9: 6.00E+12|
10: ID0010015|
11: WADORI PURWANTO|
12: 3000054789|

And when using regex2:

0: 20140608|
1: FT141590Z0LL|
2: 0608103611018634TCKJ3301000000018667|
3: 3000054789|
4: IDR1742630000001|
5: 80507|
6: 1000|
7: 6012|
8: TCKJ3301|
9: 6.00E+12|
10: ID0010015|
11: WADORI PURWANTO|
12: |
13: 3000054789|

I need some explanation of how regex engine works when handling this situation.


Solution

  • How regex works: The state machine always reads from left to right. ,|,, == ,, as it always will only be matched to the first alternation:

    img
    (source: gyazo.com)

    ,,|, == ,,?:

    x
    (source: gyazo.com)


    However, you should use ,,? instead so there's no backtracking:

    r
    (source: gyazo.com)