Search code examples
javaregexlocalizationparentheses

How to find parentheses in a string independentely of the locale?


I need to find the first complete pair of parentheses in a Java String and, if it is non-nested, return its content. The current issue is that parentheses may be represented by different characters in different locales/languages.

My first idea was of course to use regular expressions. But beside the fact that it seems quite difficult (at least to me) to make sure that there are no nested parentheses in the currently considered match if something like "\((.*)\)" is used, there seems to be no class of parenthesis-like characters available in Java's Matcher.

Thus, I tried to solve the problem more imperatively, but stumbled across the issue that the data I need to process is in different languages, and there are different parentheses' characters depending on the locale. Western: (), Chinese (Locale "zh"): ()

package main;

import java.io.BufferedReader;
import java.io.IOException;
import java.io.StringReader;
import java.util.HashSet;
import java.util.Set;

public class FindParentheses {

    static public Set<String> searchNames(final String string) throws IOException {
        final Set<String> foundName = new HashSet<>();
        final BufferedReader stringReader = new BufferedReader(new StringReader(string));
        for (String line = stringReader.readLine(); line != null; line = stringReader.readLine()) {
            final int indexOfFirstOpeningBrace = line.indexOf('(');
            if (indexOfFirstOpeningBrace > -1) {
                final String afterFirstOpeningParenthesis = line.substring(indexOfFirstOpeningBrace + 1);
                final int indexOfNextOpeningParenthesis = afterFirstOpeningParenthesis.indexOf('(');
                final int indexOfNextClosingParenthesis = afterFirstOpeningParenthesis.indexOf(')');
                /*
                 * If the following condition is fulfilled, there is a simple braced expression
                 * after the found product's short name. Otherwise, there may be an additional
                 * nested pair of braces, or the closing brace may be missing, in which cases the
                 * expression is rejected as a product's long name.
                 */
                if (indexOfNextClosingParenthesis > 0
                    && (indexOfNextClosingParenthesis < indexOfNextOpeningParenthesis
                        || indexOfNextOpeningParenthesis < 0)) {
                    final String content = afterFirstOpeningParenthesis.substring(0, indexOfNextClosingParenthesis);
                    foundName.add(content);
                }
            }
        }
        return foundName;
    }

    public static void main(final String args[]) throws IOException {
        for (final String foundName : searchNames(
            "Something meaningful: shortName1 (LongName 1).\n" +
                "Localization issue here: shortName2 (保险丝2). This one should be found, too.\n" +
                "Easy again: shortName3 (LongName 3).\n" +
            "Yet more random text...")) {
            System.out.println(foundName);
        }
    }

}

The second thing with Chinese parentheses is not found, but should be. Of course I might match those characters as an additional special case, but as my project uses 23 languages, including Korean and Japanese, I would prefer a solution that finds any pairs of parentheses.


Solution

  • Emma's answer links to Brian Campbell's list of all Unicode brackets. I used it to enumerate all relevant characters, as Wiktor Stribiżew suggested; in my case, all parentheses are of interest.

    In addition, I preferred to make sure that only matching parentheses are considered, which led me to this ugly regular expression in Java:

    public static final String ANY_PARENTHESES = "\\([^\\(⁽₍❨❪⟮⦅⸨﴾︵﹙(⦅\\)⁾₎❩❫⟯⦆⸩﴿︶﹚)⦆]+\\)|⁽[^\\(⁽₍❨❪⟮⦅⸨﴾︵﹙(⦅\\)⁾₎❩❫⟯⦆⸩﴿︶﹚)⦆]+⁾|₍[^\\(⁽₍❨❪⟮⦅⸨﴾︵﹙(⦅\\)⁾₎❩❫⟯⦆⸩﴿︶﹚)⦆]+₎|❨[^\\(⁽₍❨❪⟮⦅⸨﴾︵﹙(⦅\\)⁾₎❩❫⟯⦆⸩﴿︶﹚)⦆]+❩|❪[^\\(⁽₍❨❪⟮⦅⸨﴾︵﹙(⦅\\)⁾₎❩❫⟯⦆⸩﴿︶﹚)⦆]+❫|⟮[^\\(⁽₍❨❪⟮⦅⸨﴾︵﹙(⦅\\)⁾₎❩❫⟯⦆⸩﴿︶﹚)⦆]+⟯|⦅[^\\(⁽₍❨❪⟮⦅⸨﴾︵﹙(⦅\\)⁾₎❩❫⟯⦆⸩﴿︶﹚)⦆]+⦆|⸨[^\\(⁽₍❨❪⟮⦅⸨﴾︵﹙(⦅\\)⁾₎❩❫⟯⦆⸩﴿︶﹚)⦆]+⸩|﴾[^\\(⁽₍❨❪⟮⦅⸨﴾︵﹙(⦅\\)⁾₎❩❫⟯⦆⸩﴿︶﹚)⦆]+﴿|︵[^\\(⁽₍❨❪⟮⦅⸨﴾︵﹙(⦅\\)⁾₎❩❫⟯⦆⸩﴿︶﹚)⦆]+︶|﹙[^\\(⁽₍❨❪⟮⦅⸨﴾︵﹙(⦅\\)⁾₎❩❫⟯⦆⸩﴿︶﹚)⦆]+﹚|([^\\(⁽₍❨❪⟮⦅⸨﴾︵﹙(⦅\\)⁾₎❩❫⟯⦆⸩﴿︶﹚)⦆]+)|⦅[^\\(⁽₍❨❪⟮⦅⸨﴾︵﹙(⦅\\)⁾₎❩❫⟯⦆⸩﴿︶﹚)⦆]+⦆";
    

    which I actually constructed with the following code:

        public static final char LEFT_PARENTHESIS = '\u0028', // (
            SUPERSCRIPT_LEFT_PARENTHESIS = '\u207D', // ⁽
            SUBSCRIPT_LEFT_PARENTHESIS = '\u208D', // ₍
            MEDIUM_LEFT_PARENTHESIS_ORNAMENT = '\u2768', // ❨
            MEDIUM_FLATTENED_LEFT_PARENTHESIS_ORNAMENT = '\u276A', // ❪
            MATHEMATICAL_LEFT_FLATTENED_PARENTHESIS = '\u27EE', // ⟮
            LEFT_WHITE_PARENTHESIS = '\u2985', // ⦅
            LEFT_DOUBLE_PARENTHESIS = '\u2E28', // ⸨
            ORNATE_LEFT_PARENTHESIS = '\uFD3E', // ﴾
            PRESENTATION_FORM_FOR_VERTICAL_LEFT_PARENTHESIS = '\uFE35', // ︵
            SMALL_LEFT_PARENTHESIS = '\uFE59', // ﹙
            FULLWIDTH_LEFT_PARENTHESIS = '\uFF08', // (
            FULLWIDTH_LEFT_WHITE_PARENTHESIS = '\uFF5F'; // ⦅
    
        public static final char RIGHT_PARENTHESIS = '\u0029', // )
            SUPERSCRIPT_RIGHT_PARENTHESIS = '\u207E', // ⁾
            SUBSCRIPT_RIGHT_PARENTHESIS = '\u208E', // ₎
            MEDIUM_RIGHT_PARENTHESIS_ORNAMENT = '\u2769', // ❩
            MEDIUM_FLATTENED_RIGHT_PARENTHESIS_ORNAMENT = '\u276B', // ❫
            MATHEMATICAL_RIGHT_FLATTENED_PARENTHESIS = '\u27EF', // ⟯
            RIGHT_WHITE_PARENTHESIS = '\u2986', // ⦆
            RIGHT_DOUBLE_PARENTHESIS = '\u2E29', // ⸩
            ORNATE_RIGHT_PARENTHESIS = '\uFD3F', // ﴿
            PRESENTATION_FORM_FOR_VERTICAL_RIGHT_PARENTHESIS = '\uFE36', // ︶
            SMALL_RIGHT_PARENTHESIS = '\uFE5A', // ﹚
            FULLWIDTH_RIGHT_PARENTHESIS = '\uFF09', // )
            FULLWIDTH_RIGHT_WHITE_PARENTHESIS = '\uFF60'; // ⦆
    
        public static final String NO_PARENTHESES = "[^\\" + LEFT_PARENTHESIS + SUPERSCRIPT_LEFT_PARENTHESIS
            + SUBSCRIPT_LEFT_PARENTHESIS + MEDIUM_LEFT_PARENTHESIS_ORNAMENT + MEDIUM_FLATTENED_LEFT_PARENTHESIS_ORNAMENT
            + MATHEMATICAL_LEFT_FLATTENED_PARENTHESIS + LEFT_WHITE_PARENTHESIS + LEFT_DOUBLE_PARENTHESIS
            + ORNATE_LEFT_PARENTHESIS + PRESENTATION_FORM_FOR_VERTICAL_LEFT_PARENTHESIS + SMALL_LEFT_PARENTHESIS
            + FULLWIDTH_LEFT_PARENTHESIS + FULLWIDTH_LEFT_WHITE_PARENTHESIS + "\\" + RIGHT_PARENTHESIS
            + SUPERSCRIPT_RIGHT_PARENTHESIS + SUBSCRIPT_RIGHT_PARENTHESIS + MEDIUM_RIGHT_PARENTHESIS_ORNAMENT
            + MEDIUM_FLATTENED_RIGHT_PARENTHESIS_ORNAMENT + MATHEMATICAL_RIGHT_FLATTENED_PARENTHESIS
            + RIGHT_WHITE_PARENTHESIS + RIGHT_DOUBLE_PARENTHESIS + ORNATE_RIGHT_PARENTHESIS
            + PRESENTATION_FORM_FOR_VERTICAL_RIGHT_PARENTHESIS + SMALL_RIGHT_PARENTHESIS + FULLWIDTH_RIGHT_PARENTHESIS
            + FULLWIDTH_RIGHT_WHITE_PARENTHESIS + "]+";
    
        public static final String PARENTHESES = "\\" + LEFT_PARENTHESIS + NO_PARENTHESES + "\\" + RIGHT_PARENTHESIS;
    
        public static final String SUPERSCRIPT_PARENTHESES =
            "" + SUPERSCRIPT_LEFT_PARENTHESIS + NO_PARENTHESES + SUPERSCRIPT_RIGHT_PARENTHESIS;
    
        public static final String SUBSCRIPT_PARENTHESES =
            "" + SUBSCRIPT_LEFT_PARENTHESIS + NO_PARENTHESES + SUBSCRIPT_RIGHT_PARENTHESIS;
    
        public static final String MEDIUM_PARENTHESES_ORNAMENT =
            "" + MEDIUM_LEFT_PARENTHESIS_ORNAMENT + NO_PARENTHESES + MEDIUM_RIGHT_PARENTHESIS_ORNAMENT;
    
        public static final String MEDIUM_FLATTENED_PARENTHESES_ORNAMENT =
            "" + MEDIUM_FLATTENED_LEFT_PARENTHESIS_ORNAMENT + NO_PARENTHESES + MEDIUM_FLATTENED_RIGHT_PARENTHESIS_ORNAMENT;
    
        public static final String MATHEMATICAL_FLATTENED_PARENTHESES =
            "" + MATHEMATICAL_LEFT_FLATTENED_PARENTHESIS + NO_PARENTHESES + MATHEMATICAL_RIGHT_FLATTENED_PARENTHESIS;
    
        public static final String WHITE_PARENTHESES =
            "" + LEFT_WHITE_PARENTHESIS + NO_PARENTHESES + RIGHT_WHITE_PARENTHESIS;
    
        public static final String DOUBLE_PARENTHESES =
            "" + LEFT_DOUBLE_PARENTHESIS + NO_PARENTHESES + RIGHT_DOUBLE_PARENTHESIS;
    
        public static final String ORNATE_PARENTHESES =
            "" + ORNATE_LEFT_PARENTHESIS + NO_PARENTHESES + ORNATE_RIGHT_PARENTHESIS;
    
        public static final String PRESENTATION_FORM_FOR_VERTICAL_PARENTHESES =
            "" + PRESENTATION_FORM_FOR_VERTICAL_LEFT_PARENTHESIS + NO_PARENTHESES
            + PRESENTATION_FORM_FOR_VERTICAL_RIGHT_PARENTHESIS;
    
        public static final String SMALL_PARENTHESES =
            "" + SMALL_LEFT_PARENTHESIS + NO_PARENTHESES + SMALL_RIGHT_PARENTHESIS;
    
        public static final String FULLWIDTH_PARENTHESES =
            "" + FULLWIDTH_LEFT_PARENTHESIS + NO_PARENTHESES + FULLWIDTH_RIGHT_PARENTHESIS;
    
        public static final String FULLWIDTH_WHITE_PARENTHESES =
            "" + FULLWIDTH_LEFT_WHITE_PARENTHESIS + NO_PARENTHESES + FULLWIDTH_RIGHT_WHITE_PARENTHESIS;
    
        public static final char XOR = '|';
    
        public static final String ANY_PARENTHESES = PARENTHESES
            + XOR + SUPERSCRIPT_PARENTHESES
            + XOR + SUBSCRIPT_PARENTHESES
            + XOR + MEDIUM_PARENTHESES_ORNAMENT
            + XOR + MEDIUM_FLATTENED_PARENTHESES_ORNAMENT
            + XOR + MATHEMATICAL_FLATTENED_PARENTHESES
            + XOR + WHITE_PARENTHESES
            + XOR + DOUBLE_PARENTHESES
            + XOR + ORNATE_PARENTHESES
            + XOR + PRESENTATION_FORM_FOR_VERTICAL_PARENTHESES
            + XOR + SMALL_PARENTHESES
            + XOR + FULLWIDTH_PARENTHESES
            + XOR + FULLWIDTH_WHITE_PARENTHESES;
    

    Note however that it does not reject nested parentheses.