I have the following text:
Attorney General William Barr said the volume of information compromised was “staggering” and the largest breach in U.S. history.“This theft not only caused significant financial damage to Equifax but invaded the privacy of many, millions of Americans and imposed substantial costs and burdens on them as they had to take measures to protect themselves from identity theft,” said Mr. Barr.
I want to match text within a quote however the quote must be a min of 5 words long otherwise it should be ignored.
Currently, I am using the following regex:
(?<=[\\“|\\"])[A-Za-z0-9\.\-][A-Za-z\s,:\\’]+(?=[\”|\"])
However, this would include the quote “staggering” which is only 1 word so it should be ignored.
I realize I could accomplish this by repeating this part of Regex 5 times:
[A-Za-z\s,:\\’]+[A-Za-z\s,:\\’]+[A-Za-z\s,:\\’]+[A-Za-z\s,:\\’]+[A-Za-z\s,:\\’]+
However, I am wondering if there is a shorter and more concise way to achieve this? Perhaps by forcing the \s
in []
to appear at least 5 times?
Thanks
You need to "unroll" the character class by taking out the whitespace matching pattern out of it, and use a [<chars>]+(?:\s+[<chars>]+){4,}
like pattern. Note you should not use lookarounds here because "
can be both a leading and a trailing marker, and that may result in unwanted matches. Use a capturing group instead and access its value via matcher.group(1)
.
You may use
String regex = "[“\"]([A-Za-z0-9.-][A-Za-z,:’]*(?:\\s+[A-Za-z0-9.-][A-Za-z,:’]*){4,})[”\"]";
See the regex demo.
Then, just grab the Group 1 value:
String line = "Attorney General William Barr said the volume of information compromised was “staggering” and the largest breach in U.S. history.“This theft not only caused significant financial damage to Equifax but invaded the privacy of many, millions of Americans and imposed substantial costs and burdens on them as they had to take measures to protect themselves from identity theft,” said Mr. Barr.";
String regex = "[“\"]([A-Za-z0-9.-][A-Za-z,:’]*(?:\\s+[A-Za-z0-9.-][A-Za-z,:’]*){4,})[”\"]";
Matcher m = Pattern.compile(regex).matcher(line);
List<String> res = new ArrayList<>();
while(m.find()) {
res.add(m.group(1));
}
System.out.println(res);
See the online Java demo.
Pattern details
[“"]
- “
or "
([A-Za-z0-9.-][A-Za-z,:’]*(?:\\s+[A-Za-z0-9.-][A-Za-z,:’]*){4,})
- Group 1:
[A-Za-z0-9.-][A-Za-z,:’]*
- an ASCII alphanumeric or .
or -
and then 0+ of ASCII letters, ,
, :
, ’
chars(?:\s+[A-Za-z0-9.-][A-Za-z,:’]*){4,}
- four or more occurrences of
\s+
- 1+ whitespaces[A-Za-z0-9.-][A-Za-z,:’]*
- an ASCII alphanumeric or .
or -
and then 0+ of ASCII letters, ,
, :
, ’
chars[”"]
- "
or ”