I need to parse tables written in Jira Markup. So far I've managed to parse tables that are simple arrangements of headers and cells (cells containing either text or images or a combination of both).
I'm kind of stuck when it comes to cells with line breaks. My method does the following:
String JIRA_TABLE_REGEX = "(\\|\\|.*\\|\\|(\\n|\\r\\n|\\r))*(\\|.*\\|(\\n|\\r\\n|\\r)?)+";
Pattern TABLE_PATTERN = Pattern.compile(JIRA_TABLE_REGEX);
Matcher matcher = TABLE_PATTERN.matcher(input);
// TODO - fix rows with new lines
while (matcher.find()) {
String jiraTable = matcher.group();
// Split the input string into rows
String[] rowArray = jiraTable.split("(\\n|\\r\\n|\\r)");
int size = 0;
List<String> headers = new ArrayList<>();
List<String> cells = new ArrayList<>();
for (String row : rowArray) {
// If the row starts with "||" it's a header row
if (row.startsWith("||")) {
headers.addAll(List.of(row.substring(2, row.length() - 2).split("\\|\\|")));
size = headers.size();
} else if (row.startsWith("|")) {
cells.addAll(List.of(row.substring(1, row.length() - 1).split("\\|")));
if (size == 0) {
size = cells.size();
}
}
}
The regex being:
\\|\\|.*\\|\\|
- table header row.(\\n|\\r\\n|\\r)
- line breaks.)*
- zero or more header rows.\\|.*\\|
- table content row.?
- zero or one line break.)+
- one or more content rows.That code works for tables such as:
||Heading 1 Table 1||Heading 2 Table 1||
|_*BOLD AND ITALIC*_|_Italic_|
|Normal key|Normal Value|
|Third row, *just half BOLD*| |
giving as result every cell in the table:
but it fails miserably with:
||Heading 1 Table 2||Heading 2 Table 2||
|Col A1|Col A2|
|SECOND ROW|Second row too|
|Row with image|!image-2023-04-24-17-51-07-167.png|width=359,height=253!|
|Row with image and text |smol !image-2023-05-02-12-42-16-942.png|width=347,height=231! kitten|
|Row with text and new lines|Text
New Line
More text
Much more text|
with the result missing the last cell:
I tried the suggested approach of "split the input into rows with \|\n instead" by changing the split regex but also the one that matches the table:
private static final String JIRA_TABLE_REGEX_ALTERNATIVE = "(\\|\\|.*\\|\\|(\\n))*(\\|.*\\|(\\n)?)+";
...
rowArray = jiraTable.split("\\|\\n");
It still leaves out the last cell with all the line breaks.
Sorry, because in my original answer I didn't realize that of course you are processing a full Jira page, not only the text fragments that you exemplified.
I am not sure if it will work in every case, but I tested your code and - after the regex expressions drove me crazy - I think I came up with a possible solution:
String input1 = "||Heading 1 Table 1||Heading 2 Table 1||\n" +
"|_*BOLD AND ITALIC*_|_Italic_|\n" +
"|Normal key|Normal Value|\n" +
"|Third row, *just half BOLD*| |";
String input2 = "||Heading 1 Table 2||Heading 2 Table 2||\n" +
"|Col A1|Col A2|\n" +
"|SECOND ROW|Second row too|\n" +
"|Row with image|!image-2023-04-24-17-51-07-167.png|width=359,height=253!|\n" +
"|Row with image and text |smol !image-2023-05-02-12-42-16-942.png|width=347,height=231! kitten|\n" +
"|Row with text and new lines|Text\n" +
" \n" +
" New Line\n" +
" \n" +
" More text\n" +
" Much more text|\n";
String input = input1 + "\nA bunch of text.\n" + input2;
String JIRA_TABLE_REGEX = "\\|\\|(.|\\R[^\\|])+\\|\\|\\R((\\|(?!\\|))(.|\\R[^\\|])+(\\|(?!\\|)\\R?))+";
Pattern TABLE_PATTERN = Pattern.compile(JIRA_TABLE_REGEX);
Matcher matcher = TABLE_PATTERN.matcher(input);
// TODO - fix rows with new lines
int nt = 0;
while (matcher.find()) {
System.out.printf("Table #%d\n", ++nt);
String jiraTable = matcher.group();
// Split the input string into rows
String[] rowArray = jiraTable.split("(\\|(\\|)?)\\R");
int size = 0;
List<String> headers = new ArrayList<>();
List<String> cells = new ArrayList<>();
for (String row : rowArray) {
// If the row starts with "||" it's a header row
if (row.startsWith("||")) {
headers.addAll(Arrays.asList(row.substring(2).split("\\|\\|")));
size = headers.size();
} else if (row.startsWith("|")) {
cells.addAll(Arrays.asList(row.substring(1).split("\\|")));
if (size == 0) {
size = cells.size();
}
}
}
System.out.println("Headers:" + headers);
System.out.println("Cells:" + cells);
}
Basically it is your original code with two small changes:
String JIRA_TABLE_REGEX = "\\|\\|(.|\\R[^\\|])+\\|\\|\\R((\\|(?!\\|))(.|\\R[^\\|])+(\\|(?!\\|)\\R?))+";
Note that we use the regex term \R
as as simplification to match any kind of line terminator.
|
characters followed by any kind of line terminator:String[] rowArray = jiraTable.split("(\\|(\\|)?)\\R");
I modified the substring
s used in the processing of every row as well to avoid losing some end characters.
Please, be aware that I only tested the two examples you provided: you may try tweaking the regex a bit more to deal with tables without headers, etc.
Running the example provides the following output:
Table #1
Headers:[Heading 1 Table 1, Heading 2 Table 1]
Cells:[_*BOLD AND ITALIC*_, _Italic_, Normal key, Normal Value, Third row, *just half BOLD*, ]
Table #2
Headers:[Heading 1 Table 2, Heading 2 Table 2]
Cells:[Col A1, Col A2, SECOND ROW, Second row too, Row with image, !image-2023-04-24-17-51-07-167.png, width=359,height=253!, Row with image and text , smol !image-2023-05-02-12-42-16-942.png, width=347,height=231! kitten, Row with text and new lines, Text
New Line
More text
Much more text]