Search code examples
javaregexcryptographyxor

Regex to find reoccurring number sets in a set of a numbers


Given a set of numbers is it possible for regex to find subsets of numbers that occur of length N more than once, preferably on a loop variable N. I currently have something that finds more than single occurrences, but this returns too much noise. I want it to find sets of length N in a loop, that decrements N from large sets to small.

The seemingly arbitrary sequence of numbers is a byte array of characters converted to a string of numbers, the sets I want to catch are the possible keys for an XOR encoded file.

Given the encoded text is sufficiently long, there may be a time when N spaces are xor'd with a key of length N, which reproduces the key in roughly plaintext. I have tested this, for example:

"            " ^ "ThisIsTheKey" produces roughly "tHISiStHEkEY"

Current regex(java engine):

    String regex = "(\\d+)\\1";
    Pattern patt = Pattern.compile(regex);
    Matcher matcher = patt.matcher(sToDecode);           
    while (matcher.find())                              
    {               
        System.out.println("Repeated substring: " + matcher.group(1));
    } 

Given: 737568797372696810068791021116868686873696868657376791001117268681067368686868736865736810169686872687972686568689876796869726874749911010194687265796810111086696511099688368688369868984896876708580849586987885681111109978697865767372737668676968796870797899110101110107736868726569697978736868657394707570661101011101079878991101101026968736879686572100736868766968736879686572100736867681107968657210073686876696873687968657210073686876696873687968101110107981007368687669687368796865721007368687669681006872689968796865721007368687669687368796865721007368687673666910772100736868766968736879686572100736868766810011073687968657210073686876696873687767696868711109911010168657210073686876696873687968657210073686876696873687968657210073681111107368796865721007368687669687368796865721007368687669687299110101686572100736868766968736879686572100681056899687968657210073686876696873687968657210073686876696873687310111010772100736868766968736879686572100736868766968737368102111110736879686572100...

This will find the following reoccuring subsets:

...
Repeated substring: 736879686572100736868766968
Repeated substring: 1
Repeated substring: 0
Repeated substring: 68
Repeated substring: 6
Repeated substring: 0
Repeated substring: 68
Repeated substring: 686572100736868766968736879
Repeated substring: 1
Repeated substring: 657210073686876696873687968
...

Please let me if regex can be changed so it will only return:

Repeated substring: 736879686572100736868766968
Repeated substring: 686572100736868766968736879
Repeated substring: 657210073686876696873687968

Solution

  • Using + will match from one to many numbers, that's why you get all theses short substrings. If you want to add a constraint on the length, just change it for {n,m} where 0<=n<m (one of it can be blank).

    To get the groups of 3 and more repeating numbers, use:

    (\d{3,})\1