Search code examples
parsingartificial-intelligencepattern-recognition

String pattern recognition without training set


I have multiple strings, which are created based on a few (mostly) known variables and a few unknown templates. I'd like to know what those templates were to extract the variable parts from these strings. After that I can relatively easily infer the meaning of each substring, so only the pattern recognition is the question here. For example:

"76 (q) h"
"a x q y 123"
"c x e y 73"
"3 (e) z"
...

# pattern recognition: examples -> templates
"{1} x {2} y {3}"
"{1} ({2}) {3}"

# clusters based on template type
"{1} x {2} y {3}" -> ["a x q y 123", "c x e y 73", ...]
"{1} ({2}) {3}" -> ["76 (q) h", "3 (e) z", ...]

# inference: substrings -> extracted variables
"76 (q) h" -> ["76", "q", "h"] -> {x: "h", y: "q", z: 76}
"a x q y 123" -> ["a", "q", "123"] -> {x: "a", y: "q", z: 123}
"c x e y 73" -> ["c", "e", "73"] -> {x: "c", y: "e", z: 73}
"3 (e) z" -> ["3", "e", "z"] -> {x: "z", y: "e", z: 3}

I have found a similar question: Intelligent pattern matching in string, but in my case there is no way to train the parser with positives. Any idea how to solve this?


Solution

  • It turned out what I need is called sequential pattern mining. There are many algorithms for example SPADE, PrefixSpan, CloSpan, BIDE, etc. What I need is an algorithm, which works with gaps too, or an algorithm which finds the frequent substrings which I can concatenate with wildcards. Selecting the proper pattern from the found frequent closed patterns is far from obvious, I am still working on it, but I am a lot closer now than 2 months ago.