How can I parse tables written in Jira markup if a cell has a line break?

I need to parse tables written in Jira Markup. So far I've managed to parse tables that are simple arrangements of headers and cells (cells containing either text or images or a combination of both).

I'm kind of stuck when it comes to cells with line breaks. My method does the following:

    String JIRA_TABLE_REGEX = "(\\|\\|.*\\|\\|(\\n|\\r\\n|\\r))*(\\|.*\\|(\\n|\\r\\n|\\r)?)+";
    Pattern TABLE_PATTERN = Pattern.compile(JIRA_TABLE_REGEX);
    Matcher matcher = TABLE_PATTERN.matcher(input);

        // TODO - fix rows with new lines
        while (matcher.find()) {
            String jiraTable = matcher.group();
            // Split the input string into rows
            String[] rowArray = jiraTable.split("(\\n|\\r\\n|\\r)");
            int size = 0;
            List<String> headers = new ArrayList<>();
            List<String> cells = new ArrayList<>();

            for (String row : rowArray) {
                // If the row starts with "||" it's a header row
                if (row.startsWith("||")) {
                    headers.addAll(List.of(row.substring(2, row.length() - 2).split("\\|\\|")));
                    size = headers.size();
                } else if (row.startsWith("|")) {
                    cells.addAll(List.of(row.substring(1, row.length() - 1).split("\\|")));
                    if (size == 0) {
                        size = cells.size();
                    }
                }
            }

The regex being:

\\|\\|.*\\|\\| - table header row.
(\\n|\\r\\n|\\r) - line breaks.
)* - zero or more header rows.
\\|.*\\| - table content row.
? - zero or one line break.
)+ - one or more content rows.

That code works for tables such as:

||Heading 1 Table 1||Heading 2 Table 1||
|_*BOLD AND ITALIC*_|_Italic_|
|Normal key|Normal Value|
|Third row, *just half BOLD*| |

giving as result every cell in the table:

but it fails miserably with:

||Heading 1 Table 2||Heading 2 Table 2||
|Col A1|Col A2|
|SECOND ROW|Second row too|
|Row with image|!image-2023-04-24-17-51-07-167.png|width=359,height=253!|
|Row with image and text |smol  !image-2023-05-02-12-42-16-942.png|width=347,height=231! kitten|
|Row with text and new lines|Text
  
 New Line
  
 More text
 Much more text|

with the result missing the last cell:

UPDATE 1

I tried the suggested approach of "split the input into rows with \|\n instead" by changing the split regex but also the one that matches the table:

    private static final String JIRA_TABLE_REGEX_ALTERNATIVE = "(\\|\\|.*\\|\\|(\\n))*(\\|.*\\|(\\n)?)+";

...

rowArray = jiraTable.split("\\|\\n");

It still leaves out the last cell with all the line breaks.

Solution

Sorry, because in my original answer I didn't realize that of course you are processing a full Jira page, not only the text fragments that you exemplified.

I am not sure if it will work in every case, but I tested your code and - after the regex expressions drove me crazy - I think I came up with a possible solution:

String input1 = "||Heading 1 Table 1||Heading 2 Table 1||\n" +
    "|_*BOLD AND ITALIC*_|_Italic_|\n" +
    "|Normal key|Normal Value|\n" +
    "|Third row, *just half BOLD*| |";

String input2 = "||Heading 1 Table 2||Heading 2 Table 2||\n" +
    "|Col A1|Col A2|\n" +
    "|SECOND ROW|Second row too|\n" +
    "|Row with image|!image-2023-04-24-17-51-07-167.png|width=359,height=253!|\n" +
    "|Row with image and text |smol  !image-2023-05-02-12-42-16-942.png|width=347,height=231! kitten|\n" +
    "|Row with text and new lines|Text\n" +
    "  \n" +
    " New Line\n" +
    "  \n" +
    " More text\n" +
    " Much more text|\n";

String input = input1 + "\nA bunch of text.\n" + input2;

String JIRA_TABLE_REGEX = "\\|\\|(.|\\R[^\\|])+\\|\\|\\R((\\|(?!\\|))(.|\\R[^\\|])+(\\|(?!\\|)\\R?))+";
Pattern TABLE_PATTERN = Pattern.compile(JIRA_TABLE_REGEX);
Matcher matcher = TABLE_PATTERN.matcher(input);

// TODO - fix rows with new lines
int nt = 0;
while (matcher.find()) {
  System.out.printf("Table #%d\n", ++nt);
  String jiraTable = matcher.group();
  // Split the input string into rows
  String[] rowArray = jiraTable.split("(\\|(\\|)?)\\R");
  int size = 0;
  List<String> headers = new ArrayList<>();
  List<String> cells = new ArrayList<>();

  for (String row : rowArray) {
    // If the row starts with "||" it's a header row
    if (row.startsWith("||")) {
      headers.addAll(Arrays.asList(row.substring(2).split("\\|\\|")));
      size = headers.size();
    } else if (row.startsWith("|")) {
      cells.addAll(Arrays.asList(row.substring(1).split("\\|")));
      if (size == 0) {
        size = cells.size();
      }
    }
  }

  System.out.println("Headers:" + headers);
  System.out.println("Cells:" + cells);
}

Basically it is your original code with two small changes:

On one hand, the full table regex has been modified as follows:

String JIRA_TABLE_REGEX = "\\|\\|(.|\\R[^\\|])+\\|\\|\\R((\\|(?!\\|))(.|\\R[^\\|])+(\\|(?!\\|)\\R?))+";

Note that we use the regex term \R as as simplification to match any kind of line terminator.

On the other, we split every row searching for one (row) or two (header row) | characters followed by any kind of line terminator:

String[] rowArray = jiraTable.split("(\\|(\\|)?)\\R");

I modified the substrings used in the processing of every row as well to avoid losing some end characters.

Please, be aware that I only tested the two examples you provided: you may try tweaking the regex a bit more to deal with tables without headers, etc.

Running the example provides the following output:

Table #1
Headers:[Heading 1 Table 1, Heading 2 Table 1]
Cells:[_*BOLD AND ITALIC*_, _Italic_, Normal key, Normal Value, Third row, *just half BOLD*,  ]
Table #2
Headers:[Heading 1 Table 2, Heading 2 Table 2]
Cells:[Col A1, Col A2, SECOND ROW, Second row too, Row with image, !image-2023-04-24-17-51-07-167.png, width=359,height=253!, Row with image and text , smol  !image-2023-05-02-12-42-16-942.png, width=347,height=231! kitten, Row with text and new lines, Text
  
 New Line
  
 More text
 Much more text]