Suppose I have an error log and I wish to get a count of each type of error. I have already performed a naive count by grouping by error message, however a lot of the messages contain context-specific information, which means that despite being caused by the same bug I cannot simply group by message text.
Some examples, where the italicised segments vary per instance of error:
I would like to group all such messages using some fuzzy logic. I understand the Levenshtein Distance algorithm is valuable for this type of processing, but I guess the raw distance is not valuable because it is not weighted against the string's length (a distance of 30 is less significant in a string of 1000 characters, versus 30 out of 100).
So my aim is to iterate over a list of messages and get some kind of fuzzily matched count. There may be a side issue of generating some kind of consistent key for each fuzzily matched message? How would i go about this?
I would give q-gram distance a try. The distance between two strings is then determined by the number of N-grams they have in common. N would have to be large enough that an N-gram represents a relevant detail. N=4 might be a good starter.
Further string distances are derived from the concept of N-grams: f.x. cosine and Jaccard distance.
This text explains different types of string distance algorithms in context of R.