I am trying to extract some data from the following examples:
What I'd like my results to be respectively are:
I am happy to do this in multiple passes, using an expression grammar though I don't think that'll really help.
I'm having trouble using lookaheads and lookbehinds to grab that data and exclude things like "11-mill" and "XY-2822". What I find happening is I am able to exclude those matches but end up truncating good results for others matches.
What is the best way to go about this?
My current regex is
/(?:(\d+)[b\b\/-])([b\d\b]*)[^a-z]/i
which is capturing the letter 'b' (which is okay) but not capturing 34b in the final example
Not sure what are your exact requirements/formats but you can try this:
/(?:\G(?!^)[-\/]|^(?:.*[^\d\/-])?)\K\d++(?![-\/]\D)/
http://rubular.com/r/WJqcCNe2pr
details:
# two possible starts:
(?: # next occurrences
\G # anchor for the position after the previous match
(?!^) # not at the start of the line
[-\/]
| # first occurrence
^
(?:.*[^\d\/-])? # (note the greedy quantifier here,
# to obtain the last result of the line)
)
\K # discards characters matched before from the whole match
\d++ # several digits with a possessive quantifier to forbid backtracking
(?![-\/]\D) # not followed by an hyphen of a slash and a non-digit
You can improve the pattern if you replace (?:.*[^\d\/-])?
with [^-\d\/\n]*+(?>[-\d\/]+[^-\d\/\n]+)*
(remove the \n
if you work line by line.). The goal of this change is to limit the backtracking (that occurs atomic group by atomic group, instead of character by character for the first version).
Perhaps, you can replace the negative lookahead with this kind of positive lookahead: (?=[-\/]\d|b|$)
An other version here.