I have multiple strings, which are created based on a few (mostly) known variables and a few unknown templates. I'd like to know what those templates were to extract the variable parts from these strings. After that I can relatively easily infer the meaning of each substring, so only the pattern recognition is the question here. For example:
"76 (q) h"
"a x q y 123"
"c x e y 73"
"3 (e) z"
...
# pattern recognition: examples -> templates
"{1} x {2} y {3}"
"{1} ({2}) {3}"
# clusters based on template type
"{1} x {2} y {3}" -> ["a x q y 123", "c x e y 73", ...]
"{1} ({2}) {3}" -> ["76 (q) h", "3 (e) z", ...]
# inference: substrings -> extracted variables
"76 (q) h" -> ["76", "q", "h"] -> {x: "h", y: "q", z: 76}
"a x q y 123" -> ["a", "q", "123"] -> {x: "a", y: "q", z: 123}
"c x e y 73" -> ["c", "e", "73"] -> {x: "c", y: "e", z: 73}
"3 (e) z" -> ["3", "e", "z"] -> {x: "z", y: "e", z: 3}
I have found a similar question: Intelligent pattern matching in string, but in my case there is no way to train the parser with positives. Any idea how to solve this?
It turned out what I need is called sequential pattern mining. There are many algorithms for example SPADE, PrefixSpan, CloSpan, BIDE, etc. What I need is an algorithm, which works with gaps too, or an algorithm which finds the frequent substrings which I can concatenate with wildcards. Selecting the proper pattern from the found frequent closed patterns is far from obvious, I am still working on it, but I am a lot closer now than 2 months ago.