Working on something similar to Solr's WordDelimiterFilter, but not in Java.
Want to split words into tokens like this:
P90X = P, 90, X (split on word/number boundary)
TotallyCromulentWord = Totally, Cromulent, Word (split on lowercase/uppercase boundary)
TransAM = Trans, AM
Looking for a general solution, not specific to the above examples. Preferably in a regex flavour that doesn't support lookbehind, but I can use PL/perl if necessary, which can do lookbehind.
Found a few answers on SO, but they all seemed to use lookbehind.
Things to split on:
My main concern is 1 and 2.
That's not something I'd like to do without lookbehind, but for the challenge, here is a javascript solution that you should be able to easily convert into whatever language:
function split(s) {
var match;
var result = [];
while (Boolean(match = s.match(/([A-Z]+|[A-Z]?[a-z]+|[0-9]+|([^a-zA-Z0-9])+)$/))) {
if (!match[2]) {
//don't return non alphanumeric tokens
result.unshift(match[1]);
}
s = s.substring(0, s.length - match[1].length);
}
return result;
}
Demo:
P90X [ 'P', '90', 'X' ]
TotallyCromulentWord [ 'Totally', 'Cromulent', 'Word' ]
TransAM [ 'Trans', 'AM' ]
URLConverter [ 'URL', 'Converter' ]
Abc.DEF$012 [ 'Abc', 'DEF', '012' ]