Search code examples
javaregexjira

How can I parse tables written in Jira markup if a cell has a line break?


I need to parse tables written in Jira Markup. So far I've managed to parse tables that are simple arrangements of headers and cells (cells containing either text or images or a combination of both).

I'm kind of stuck when it comes to cells with line breaks. My method does the following:

    String JIRA_TABLE_REGEX = "(\\|\\|.*\\|\\|(\\n|\\r\\n|\\r))*(\\|.*\\|(\\n|\\r\\n|\\r)?)+";
    Pattern TABLE_PATTERN = Pattern.compile(JIRA_TABLE_REGEX);
    Matcher matcher = TABLE_PATTERN.matcher(input);

        // TODO - fix rows with new lines
        while (matcher.find()) {
            String jiraTable = matcher.group();
            // Split the input string into rows
            String[] rowArray = jiraTable.split("(\\n|\\r\\n|\\r)");
            int size = 0;
            List<String> headers = new ArrayList<>();
            List<String> cells = new ArrayList<>();

            for (String row : rowArray) {
                // If the row starts with "||" it's a header row
                if (row.startsWith("||")) {
                    headers.addAll(List.of(row.substring(2, row.length() - 2).split("\\|\\|")));
                    size = headers.size();
                } else if (row.startsWith("|")) {
                    cells.addAll(List.of(row.substring(1, row.length() - 1).split("\\|")));
                    if (size == 0) {
                        size = cells.size();
                    }
                }
            }

The regex being:

  1. \\|\\|.*\\|\\| - table header row.
  2. (\\n|\\r\\n|\\r) - line breaks.
  3. )* - zero or more header rows.
  4. \\|.*\\| - table content row.
  5. ? - zero or one line break.
  6. )+ - one or more content rows.

That code works for tables such as:

||Heading 1 Table 1||Heading 2 Table 1||
|_*BOLD AND ITALIC*_|_Italic_|
|Normal key|Normal Value|
|Third row, *just half BOLD*| |

giving as result every cell in the table: enter image description here

but it fails miserably with:

||Heading 1 Table 2||Heading 2 Table 2||
|Col A1|Col A2|
|SECOND ROW|Second row too|
|Row with image|!image-2023-04-24-17-51-07-167.png|width=359,height=253!|
|Row with image and text |smol  !image-2023-05-02-12-42-16-942.png|width=347,height=231! kitten|
|Row with text and new lines|Text
  
 New Line
  
 More text
 Much more text|

with the result missing the last cell: enter image description here

UPDATE 1

I tried the suggested approach of "split the input into rows with \|\n instead" by changing the split regex but also the one that matches the table:

    private static final String JIRA_TABLE_REGEX_ALTERNATIVE = "(\\|\\|.*\\|\\|(\\n))*(\\|.*\\|(\\n)?)+";

...

rowArray = jiraTable.split("\\|\\n");

It still leaves out the last cell with all the line breaks.


Solution

  • Sorry, because in my original answer I didn't realize that of course you are processing a full Jira page, not only the text fragments that you exemplified.

    I am not sure if it will work in every case, but I tested your code and - after the regex expressions drove me crazy - I think I came up with a possible solution:

    String input1 = "||Heading 1 Table 1||Heading 2 Table 1||\n" +
        "|_*BOLD AND ITALIC*_|_Italic_|\n" +
        "|Normal key|Normal Value|\n" +
        "|Third row, *just half BOLD*| |";
    
    String input2 = "||Heading 1 Table 2||Heading 2 Table 2||\n" +
        "|Col A1|Col A2|\n" +
        "|SECOND ROW|Second row too|\n" +
        "|Row with image|!image-2023-04-24-17-51-07-167.png|width=359,height=253!|\n" +
        "|Row with image and text |smol  !image-2023-05-02-12-42-16-942.png|width=347,height=231! kitten|\n" +
        "|Row with text and new lines|Text\n" +
        "  \n" +
        " New Line\n" +
        "  \n" +
        " More text\n" +
        " Much more text|\n";
    
    String input = input1 + "\nA bunch of text.\n" + input2;
    
    String JIRA_TABLE_REGEX = "\\|\\|(.|\\R[^\\|])+\\|\\|\\R((\\|(?!\\|))(.|\\R[^\\|])+(\\|(?!\\|)\\R?))+";
    Pattern TABLE_PATTERN = Pattern.compile(JIRA_TABLE_REGEX);
    Matcher matcher = TABLE_PATTERN.matcher(input);
    
    // TODO - fix rows with new lines
    int nt = 0;
    while (matcher.find()) {
      System.out.printf("Table #%d\n", ++nt);
      String jiraTable = matcher.group();
      // Split the input string into rows
      String[] rowArray = jiraTable.split("(\\|(\\|)?)\\R");
      int size = 0;
      List<String> headers = new ArrayList<>();
      List<String> cells = new ArrayList<>();
    
      for (String row : rowArray) {
        // If the row starts with "||" it's a header row
        if (row.startsWith("||")) {
          headers.addAll(Arrays.asList(row.substring(2).split("\\|\\|")));
          size = headers.size();
        } else if (row.startsWith("|")) {
          cells.addAll(Arrays.asList(row.substring(1).split("\\|")));
          if (size == 0) {
            size = cells.size();
          }
        }
      }
    
      System.out.println("Headers:" + headers);
      System.out.println("Cells:" + cells);
    }
    

    Basically it is your original code with two small changes:

    • On one hand, the full table regex has been modified as follows:
    String JIRA_TABLE_REGEX = "\\|\\|(.|\\R[^\\|])+\\|\\|\\R((\\|(?!\\|))(.|\\R[^\\|])+(\\|(?!\\|)\\R?))+";
    

    Note that we use the regex term \R as as simplification to match any kind of line terminator.

    • On the other, we split every row searching for one (row) or two (header row) | characters followed by any kind of line terminator:
    String[] rowArray = jiraTable.split("(\\|(\\|)?)\\R");
    

    I modified the substrings used in the processing of every row as well to avoid losing some end characters.

    Please, be aware that I only tested the two examples you provided: you may try tweaking the regex a bit more to deal with tables without headers, etc.

    Running the example provides the following output:

    Table #1
    Headers:[Heading 1 Table 1, Heading 2 Table 1]
    Cells:[_*BOLD AND ITALIC*_, _Italic_, Normal key, Normal Value, Third row, *just half BOLD*,  ]
    Table #2
    Headers:[Heading 1 Table 2, Heading 2 Table 2]
    Cells:[Col A1, Col A2, SECOND ROW, Second row too, Row with image, !image-2023-04-24-17-51-07-167.png, width=359,height=253!, Row with image and text , smol  !image-2023-05-02-12-42-16-942.png, width=347,height=231! kitten, Row with text and new lines, Text
      
     New Line
      
     More text
     Much more text]