I'm extracting from bibtex and have a little problem, as the format can have values wrapped inside curly brackets OR NOT. Please find the example text below:
@article{Roxas_2011, title={Social Desirability Bias in Survey Research on Sustainable Development in Small Firms: an Exploratory Analysis of Survey Mode Effect}, volume={21}, ISSN={1099-0836}, url={http://dx.doi.org/10.1002/bse.730}, DOI={10.1002/bse.730}, number={4}, journal={Business Strategy and the Environment}, publisher={Wiley}, author={Roxas, Banjo and Lindsay, Val}, year={2011}, month=sep, pages={223\xe2\x80\x93235} }
A you can see, all except month are x={y}, so a simple (PHP preg_match with mUg flags):
[\s,]+(.*)={(.*[^}])}
Does the trick for everything except month=sep.
If I try using ", " as delimited, it aparantly also splits authors. Can you please help me? :)
Thanks :)
You can use
[\s,]+(.*?)=(?|{([^{}]*)}|(\w+))
Note you should not use any flags with the regex (you may use an s
flag to make .
match line break chars and you may use u
flag to make \w
and \s
match all Unicode word/whitespace chars - if you need).
See the regex demo.
Details
[\s,]+
- one or more whitespaces or/and commas(.*?)
- Group 1: any zero or more chars other than line break chars as few as possible=
- a =
char(?|{([^{}]*)}|(\w+))
- a branch reset group matching:
{([^{}]*)}
- a {
char, any zero or more chars other than {
and }
captured into Group 2, a }
char.|
- or(\w+)
- Group 2: one or more word chars.