I've been struggling to write a single line regular expression that splits everything I need. I really wanna exhaust all my possibilities before I resort to re-evaluating my data a second time.
Currently I've been using this regular expression to split a single line of ascii data into fragments:
line = 'setAttr -s 2 ".iog[0].og"'
re.split(r'("[^"\\]*(?:\\.[^"\\]*)*"|[^\s();]+)', line)
// Result: ['setAttr', '-s', '2', '".iog[0].og"']
What I really want is to capture just the text within the string quotes along with all of the regular words, numbers, and flags:
// Result: ['setAttr', '-s', '2', '.iog[0].og']
I know this seems silly, but performance time is a major make or break for this code. Shlex is currently out of the question since it takes way too long to process thousands of lines of data.
Does anybody know of such an expression?
You may capture the parts you need with two capturing groups and then concat them:
r'"([^"\\]*(?:\\.[^"\\]*)*)"|([^\s();]+)'
^ ^ ^ ^
It will work because the capturing groups will only be filled one at a time, one of them will always be empty:
["{}{}".format(x,y) for x, y in re.findall(r'"([^"\\]*(?:\\.[^"\\]*)*)"|([^\s();]+)', line)]
See the Python demo