Search code examples
regexstringalgorithmgrammar-induction

Automatically built regex expressions that fit set of strings


We have written the system to analyse log messages from the large network. The system takes log messages from lots of different network elements, and analyses it by regex expressions. For example user may have written two rules:

^cron/script\.sh.*
.*script\.sh [0-9]+$

In this case only logs that match given patterns will be selected. The reason of the filtering is that there may be really lots of log messages, up to 1 GB per day.

Now the main part of my question. Since there is lots of network elements, and several types of them, and every one of them has different parameters in path... Is there any way to automatically generate set of regexes that will somehow group the logs? The system can learn on historical data, e.g. from the last week. Generated regex must not be very accurate, it is supposed to be the hint for user to add such new rule into system.

I was thinking about unsupervised machine learning to divide input into groups and then in each group find proper regex. Is there any other way, maybe faster or better? And, last but not least, how to find regex that matches all strings in obtained group? (Non-trivial, so .* is not the answer.)


Edit After some thinking I'll try to simplify the problem. Suppose I have already grouped logs. I'd like to find (at most) three largest substrings (at least one) common to all the strings in set. For example:

Set of strings:
cron/script1.sh -abc 1243 all
cron/script2.sh 1
bin/script1.sh -asdf 15

Obtained groups:
/script
.sh 

Now I can build some simple regex by concatenating these groups with .*?. In this example it would be .*?(/script).*?(\.sh ).*?. It seems to be simpler solution.


Solution

  • OK, we'll try to break this down into manageable steps.

      1. For each substring w in s1, in order of non-increasing length,
      2.  assume w is a substring of the other sM
      3.  for each string of the other sN,
      4.   if w is not a substring of sN, disprove assumption and break
      5.  if the assumption held, save w
      6.  if you've found three w that work, break
      7. You have recorded between 0 and 3 w that work.
    

    Note that not all sets of strings are guaranteed to have common substrings (except the empty string). In the worst case, assume s1 is the longest string. There are O(n^2) substrings of s1 (|s1| = n) and it takes O(n) to compare to each of m other strings... so the asymptotic complexity is, I believe, O(n^2 * nm)... even though the algorithm is naive, this should be pretty manageable (polynomial, after all, and quadratic at that).

    The transformation to e.g. C code should be straightforward... use a sliding window with a decrementing length loop to get substrings of s1, and then use linear searchers to find matches in the other strings.

    I'm sure there are smarter / asymptotically better ways of doing this, but any algorithm will have to look at all characters in all strings, so O(nm)... may not be completely right here.