Is there any "already-implemented" (not manual) way to replace all occurrences of single byte-array/string inside byte array ? I have a case where i need to create byte array containing platform dependent text (Linux (line feed), Windows (carriage return + line feed)). I know such task can be implemented manually but i am looking for out-of-the-box solution. Note that these byte array's are large and solution needs to be performance wise in my case. Also note that i am processing large amount of these byte-arrays.
My current approach:
var byteArray = resourceLoader.getResource("classpath:File.txt").getInputStream().readAllBytes();
byteArray = new String(byteArray)
.replaceAll((schemeModel.getOsType() == SystemTypes.LINUX) ? "\r\n" : "\n",
(schemeModel.getOsType() == SystemTypes.LINUX) ? "\n" : "\r\n"
).getBytes(StandardCharsets.UTF_8);
This approach is not performance wise because of creating new Strings and using regex to find occurrences. I know that manual implementation would require looking at sequence of bytes because of Windows encoding. Manual implementation would therefore also require reallocation (if needed) as well.
Appache common lang utils contains ArrayUtils
which contains method
byte[] removeAllOccurrences(byte[] array, byte element)
. Is there any third party library which contains similar method for replacing ALL byte-arrays/strings occurrences inside byte array ??
Edit: As @saka1029 mentioned in comments, my approach doesn't work for Windows OS type. Because of this bug i need to stick with regexes as following:
(schemeModel.getOsType() == SystemTypes.LINUX) ? "\\r\\n" : "[?:^\\r]\\n",
(schemeModel.getOsType() == SystemTypes.LINUX) ? "\n" : "\r\n")
This way, for windows case, only occurrences of '\n' without preceding '\r' are searched and replaced with '\r\n' (regex is modified to find group at '\n' not at [^\r]\n position directly otherwise last letter from line would be extracted as well). Such workflow cannot be implemented using conventional methods thus invalidates this question.
If you’re reading text, you should treat it as text, not as bytes. Use a BufferedReader to read the lines one by one, and insert your own newline sequences.
String newline = schemeModel.getOsType() == SystemTypes.LINUX ? "\n" : "\r\n";
OutputStream out = /* ... */;
try (Writer writer = new BufferedWriter(
new OutputStreamWriter(out, StandardCharsets.UTF_8));
BufferedReader reader = new BufferedReader(
new InputStreamReader(
resourceLoader.getResource("classpath:File.txt").getInputStream(),
StandardCharsets.UTF_8))) {
String line;
while ((line = reader.readLine()) != null) {
writer.write(line);
writer.write(newline);
}
}
No byte array needed, and you are using only a small amount of memory—the amount needed to hold the largest line encountered. (I rarely see text with a line longer than one kilobyte, but even one megabyte would be a pretty small memory requirement.)
If you are “fixing” zip entries, the OutputStream can be a ZipOutputStream pointing to a new ZipEntry:
String newline = schemeModel.getOsType() == SystemTypes.LINUX ? "\n" : "\r\n";
ZipInputStream oldZip = /* ... */;
ZipOutputStream newZip = /* ... */;
ZipEntry entry;
while ((entry = oldZip.getNextEntry()) != null) {
newZip.putNextEntry(entry);
// We only want to fix line endings in text files.
if (!entry.getName().matches(".*\\." +
"(?i:txt|x?html?|xml|json|[ch]|cpp|cs|py|java|properties|jsp)")) {
oldZip.transferTo(newZip);
continue;
}
Writer writer = new BufferedWriter(
new OutputStreamWriter(newZip, StandardCharsets.UTF_8));
BufferedReader reader = new BufferedReader(
new InputStreamReader(oldZip, StandardCharsets.UTF_8));
String line;
while ((line = reader.readLine()) != null) {
writer.write(line);
writer.write(newline);
}
writer.flush();
}
Some notes:
\n
for everything except Windows. That is, schemeModel.getOsType() == SystemTypes.WINDOWS ? "\r\n" : "\n"
new String(byteArray)
which assumes the bytes of your resource use the default Charset of the system on which your program is running. I suspect this is not what you intended; I have added StandardCharsets.UTF_8
to the construction of the InputStreamReader to address this. If you really meant to read the bytes using the default Charset, you can remove that second constructor argument.