Search code examples
javaregexinsertstringbuilder

Dynamically inserting characters into a StringBuilder and Java Matcher


I have the following scenario:

I have a one liner flat file. The line is structured such as it has a a header and then the corresponding data. It looks something like this:

HEADER1 data data data data data HEADER2 data data HEADER3 data HEADER4 data ....

I have to convert this one liner to a format, where each header is on a separate line, along with its data. So, it should look like this:

HEADER1 data data data data data
HEADER2 data data 
HEADER3 data

The "HEADER" itself follows a consistent pattern in length and type of characters it could use. So, i figured Java Regex Pattern and a Matcher would be the way to go.

I am using a StringBuilder, since it has an insert() method, which i am using to insert a line separator.

The problem i am having is that there is always a line at end of my newly created file (the one with the line separator inserts) that consists of several headers i.e they don't seem to get broken into new lines. It seems the reason for that is the fact that as soon as Matcher.find() stumbles upon a match that has a start index outside of the Matcher's region the execution exits the code where a new line is inserted.

This behavior is very inconsistent. I have flat files that are fairly short (about 50 lines), where the problem does not appear. Then i have a flat files that are 20K bytes/characters, where the problem appears.

It seems the Matcher does Matcher.find() it goes of the initial data (region) that was supplied when reading the one liner. Let's say the Matcher region is from 0 to 19688. But, then as i am inserting System.lineSeparator() the size of the StringBuilder dynamically increases by 2 bytes (\r\n)

I have tried using Matcher.reset() or modifying the Matcher's region as it was suggested here: Replace text in StringBuilder via regex

How do i deal with this issue in the most efficient and correct way? Thanks

p.s. Regex is not the problem. My regex matches every single header i have in the one liner. Just thought i'd point that out to avoid discussing the regex itself.

Here is my code:

    BufferedReader br = new BufferedReader(new FileReader(Constants.SOURCE_LOCATION+fileName));
    try {

        String origLine = br.readLine();

        StringBuilder line = null;

        while (origLine != null) {              
            line = new StringBuilder(origLine);
             Pattern pattern = Pattern.compile(Constants.AL3GROUP_REGEX_PATTERN);
             Matcher matcher = pattern.matcher(line);

                while (matcher.find()) {                            
                        line.insert(matcher.start(), System.lineSeparator());                           
                }                   


            origLine = br.readLine();
        }

        converterFileContents = line.toString();

        PrintWriter writer = new PrintWriter("sample\\output.txt");
        writer.println(converterFileContents);
        writer.close();


        System.out.println(converterFileContents);
    } finally {
        br.close();
    }

Solution

  • try replaceAll

        str = str.replaceAll(" (HEADER\\d+)", "\r\n$1");